The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
The objective is to analyze the customers' data and information to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards.
Following are the Key questions to be solved:
The records contain the Customer's personal information and their travel details & patterns. It also contains Customer interaction information during their sales pitch and their learnings from those sales discussions.
The detailed data dictionary is given below:
Customer Details
# Importing the Python Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
from IPython.display import Image
# Importing libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
# Used for Ignore warnings. When we generate the output, then we can use this ignore warning
import warnings
warnings.filterwarnings("ignore")
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# this will help in making the Python code more structured automatically (good coding practice)
!pip install nb-black
%reload_ext nb_black
# Command to tell Python to actually display the graphs
%matplotlib inline
# let's start by installing plotly
!pip install plotly
# importing plotly
import plotly.express as px
# Command to hide the 'already satisfied' warnining from displaying
%pip install keras | grep -v 'already satisfied'
# Constant for making bold text
boldText = "\033[1m"
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 500)
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to build Bagging model
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# to build Boosting model
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pd.set_option("mode.chained_assignment", None)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Install library using
# In jupyter notebook
# !pip install shap
# or
# In anaconda command prompt
# conda install -c conda-forge shap - in conda prompt
import shap
Requirement already satisfied: nb-black in c:\users\cpaul\anaconda3\lib\site-packages (1.0.7) Requirement already satisfied: ipython in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (7.29.0) Requirement already satisfied: black>='19.3' in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (19.10b0) Requirement already satisfied: regex in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (2021.8.3) Requirement already satisfied: pathspec<1,>=0.6 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.7.0) Requirement already satisfied: typed-ast>=1.4.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.3) Requirement already satisfied: attrs>=18.1.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (21.2.0) Requirement already satisfied: appdirs in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.4) Requirement already satisfied: click>=6.5 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (8.0.3) Requirement already satisfied: toml>=0.9.4 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.10.2) Requirement already satisfied: colorama in c:\users\cpaul\anaconda3\lib\site-packages (from click>=6.5->black>='19.3'->nb-black) (0.4.4) Requirement already satisfied: decorator in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: jedi>=0.16 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.18.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (3.0.20) Requirement already satisfied: pickleshare in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.7.5) Requirement already satisfied: pygments in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (2.10.0) Requirement already satisfied: traitlets>=4.2 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: backcall in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.2.0) Requirement already satisfied: setuptools>=18.5 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (58.0.4) Requirement already satisfied: matplotlib-inline in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.1.2) Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\cpaul\anaconda3\lib\site-packages (from jedi>=0.16->ipython->nb-black) (0.8.2) Requirement already satisfied: wcwidth in c:\users\cpaul\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->nb-black) (0.2.5) Requirement already satisfied: plotly in c:\users\cpaul\anaconda3\lib\site-packages (5.7.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (8.0.1) Requirement already satisfied: six in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (1.16.0) Note: you may need to restart the kernel to use updated packages.
# Loading Used Cars Dataset
df = pd.read_csv("../Dataset/BankChurners.csv")
# same random results every time
np.random.seed(1)
df.sample(n=10)
# To copy the data to another object
custData = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 10127 row samples and 21 attributes of the customer information collected in this dataset.
df.head(5) # Displaying the fist 10 rows of the Dataset
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df.tail(5) # Displaying the last 10 rows of the Dataset
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | NaN | NaN | NaN | 739177606.333663 | 36903783.450231 | 708082083.0 | 713036770.5 | 717926358.0 | 773143533.0 | 828343083.0 |
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.0 | NaN | NaN | NaN | 46.32596 | 8.016814 | 26.0 | 41.0 | 46.0 | 52.0 | 73.0 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.0 | NaN | NaN | NaN | 2.346203 | 1.298908 | 0.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.0 | NaN | NaN | NaN | 35.928409 | 7.986416 | 13.0 | 31.0 | 36.0 | 40.0 | 56.0 |
| Total_Relationship_Count | 10127.0 | NaN | NaN | NaN | 3.81258 | 1.554408 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| Months_Inactive_12_mon | 10127.0 | NaN | NaN | NaN | 2.341167 | 1.010622 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Contacts_Count_12_mon | 10127.0 | NaN | NaN | NaN | 2.455317 | 1.106225 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Credit_Limit | 10127.0 | NaN | NaN | NaN | 8631.953698 | 9088.77665 | 1438.3 | 2555.0 | 4549.0 | 11067.5 | 34516.0 |
| Total_Revolving_Bal | 10127.0 | NaN | NaN | NaN | 1162.814061 | 814.987335 | 0.0 | 359.0 | 1276.0 | 1784.0 | 2517.0 |
| Avg_Open_To_Buy | 10127.0 | NaN | NaN | NaN | 7469.139637 | 9090.685324 | 3.0 | 1324.5 | 3474.0 | 9859.0 | 34516.0 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | NaN | NaN | NaN | 4404.086304 | 3397.129254 | 510.0 | 2155.5 | 3899.0 | 4741.0 | 18484.0 |
| Total_Trans_Ct | 10127.0 | NaN | NaN | NaN | 64.858695 | 23.47257 | 10.0 | 45.0 | 67.0 | 81.0 | 139.0 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | NaN | NaN | NaN | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
Data Description: Click to return to TOC
CLIENTNUM - There are 10127 samples of customers that has been provided in the dataset.Attrition_Flag - There are two unique values - "Existing Customer" & "Attrited Customer". The "Existing Customer" has the most number of occurences. This will be the Target variable for analysing the modelCustomer_Age - Age of customers range from 26 - 73 years, with 50% of the customers of the age of 46. The data seems to be uniformly distributed based on the histo chartGender - There are 2 unique values with F being the most occurences. Dependent Count - The dependent counts varies from 0 - 5. This column can be treated as a category typeEducation Level - There are 6 types of unique values with "Graduate" being the most occurence. There are missing values which needs to be treated. This column can be treated as a category typeMarital Status - There are 3 types of unique values with "Married" bring the most occurence. There are missing values which needs to be treated.This column can be treated as a category typeIncome Category - There are 6 unique values observed with "Less than 40K" has the frequent occurence. This column can betreated as a category typeCard Category - There are 4 unique values observed with "Blue" card has the frequent occurence. This column can be treated as a category typeMonths_on_book - On an average the customers have been associated with the back for 36 months and ranging between 13 - 56 months. The data seems to be uniformly distributed based on the histo chartTotal_Relationship_Count - On an average customers are holding 4 products. Mean & Median are almost the same. Min is 1 and Max of 6 products. This column can be treated as a Category type Months_Inactive_12_mon & Contacts_Count_12_mon - On an average, 50% of the customers are inactive for almost 2 months with a max inactivity of 6 months. These columns can be treated as a Category typeCredit_Limit - Customers hold an average credit limit of 8632 while 50% of the customers have 4549 as their creidt limit. The Mean & Median are in the extremes. Min credit limit is 1438 & max is 34516. Need to check for outliersTotal_Revolving_Bal - On an average, customers maintain a revolving balance of 1163 while 50% of them have 1276. Max balance observed is 2517Avg_Open_To_Buy - Customers maintain an average of 7469 amount left in card while 50% of the customers have 3474 as their balance. The Mean & Median are in the extremes. Min credit limit is 1438 & max is 34516. Need to check for outliersTotal_Amt_Chng_Q4_Q1 - Ratio of the total transaction count between Q4 & Q1 is at an average of 0.76 with a min of 0 to 3.4Total_Ct_Chng_Q4_Q1 - Ratio of the total transaction count between Q4 & Q1 is at an average of 0.7 with a min of 0 to 3.7Total_Trans_Amt - Total transaction amounts range between 510 - 18484 with an average of 4404 over the course of 12 monthsTotal_Trans_Ct - Total transaction count range between 10 - 139 transactions, with an average of 64 transaction in 12 monthsAvg_Utilization_Ratio - The average utlization ratio is around 0.27 with sum not utlizing the credit and having a min ratio value of 0 to max of 0.99 completly utlizing itdf.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
Observations:
CLIENTNUM as it is unique for each customer and will not add value to the model# Dropping the 'ID' columns since its not required
df.drop(["CLIENTNUM"], axis=1, inplace=True)
# Checking for duplicated rows in the dataset
duplicateSum = df.duplicated().sum()
print("**Inferences:**")
if duplicateSum > 0:
print(f"- There are {str(duplicateSum)} duplicated row(s) in the dataset")
# Removing the duplicated rows in the dataset
df.drop_duplicates(inplace=True)
print(
f"- There are {str(df.duplicated().sum())} duplicated row(s) in the dataset post cleaning"
)
df.duplicated().sum()
# resetting the index of data frame since some rows will be removed
df.reset_index(drop=True, inplace=True)
else:
print("- There are no duplicated row(s) in the dataset")
**Inferences:** - There are no duplicated row(s) in the dataset
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observations:
Education Level & Marital Status which needs to be treated# printing the number of occurrences of each unique value in each categorical column
num_to_display = 10
for column in df.describe(include="all").columns:
val_counts = df[column].value_counts(
dropna=False
) # Kept dropNA to False to see the NA value count as well
print("Unique values in", column, "are :")
print(val_counts.iloc[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
print("-" * 50)
print(" ")
Unique values in Attrition_Flag are : Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 -------------------------------------------------- Unique values in Customer_Age are : 44 500 49 495 46 490 45 486 47 479 43 473 48 472 50 452 42 426 51 398 Name: Customer_Age, dtype: int64 Only displaying first 10 of 45 values. -------------------------------------------------- Unique values in Gender are : F 5358 M 4769 Name: Gender, dtype: int64 -------------------------------------------------- Unique values in Dependent_count are : 3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64 -------------------------------------------------- Unique values in Education_Level are : Graduate 3128 High School 2013 NaN 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 -------------------------------------------------- Unique values in Marital_Status are : Married 4687 Single 3943 NaN 749 Divorced 748 Name: Marital_Status, dtype: int64 -------------------------------------------------- Unique values in Income_Category are : Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 -------------------------------------------------- Unique values in Card_Category are : Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 -------------------------------------------------- Unique values in Months_on_book are : 36 2463 37 358 34 353 38 347 39 341 40 333 31 318 35 317 33 305 30 300 Name: Months_on_book, dtype: int64 Only displaying first 10 of 44 values. -------------------------------------------------- Unique values in Total_Relationship_Count are : 3 2305 4 1912 5 1891 6 1866 2 1243 1 910 Name: Total_Relationship_Count, dtype: int64 -------------------------------------------------- Unique values in Months_Inactive_12_mon are : 3 3846 2 3282 1 2233 4 435 5 178 6 124 0 29 Name: Months_Inactive_12_mon, dtype: int64 -------------------------------------------------- Unique values in Contacts_Count_12_mon are : 3 3380 2 3227 1 1499 4 1392 0 399 5 176 6 54 Name: Contacts_Count_12_mon, dtype: int64 -------------------------------------------------- Unique values in Credit_Limit are : 34516.0 508 1438.3 507 9959.0 18 15987.0 18 23981.0 12 2490.0 11 6224.0 11 3735.0 11 7469.0 10 2069.0 8 Name: Credit_Limit, dtype: int64 Only displaying first 10 of 6205 values. -------------------------------------------------- Unique values in Total_Revolving_Bal are : 0 2470 2517 508 1965 12 1480 12 1434 11 1664 11 1720 11 1590 10 1542 10 1528 10 Name: Total_Revolving_Bal, dtype: int64 Only displaying first 10 of 1974 values. -------------------------------------------------- Unique values in Avg_Open_To_Buy are : 1438.3 324 34516.0 98 31999.0 26 787.0 8 701.0 7 713.0 7 953.0 7 463.0 7 990.0 6 788.0 6 Name: Avg_Open_To_Buy, dtype: int64 Only displaying first 10 of 6813 values. -------------------------------------------------- Unique values in Total_Amt_Chng_Q4_Q1 are : 0.791 36 0.712 34 0.743 34 0.718 33 0.735 33 0.744 32 0.699 32 0.722 32 0.731 31 0.631 31 Name: Total_Amt_Chng_Q4_Q1, dtype: int64 Only displaying first 10 of 1158 values. -------------------------------------------------- Unique values in Total_Trans_Amt are : 4253 11 4509 11 4518 10 2229 10 4220 9 4869 9 4037 9 4313 9 4498 9 4042 9 Name: Total_Trans_Amt, dtype: int64 Only displaying first 10 of 5033 values. -------------------------------------------------- Unique values in Total_Trans_Ct are : 81 208 71 203 75 203 69 202 82 202 76 198 77 197 70 193 74 190 78 190 Name: Total_Trans_Ct, dtype: int64 Only displaying first 10 of 126 values. -------------------------------------------------- Unique values in Total_Ct_Chng_Q4_Q1 are : 0.667 171 1.000 166 0.500 161 0.750 156 0.600 113 0.800 101 0.714 92 0.833 85 0.778 69 0.625 63 Name: Total_Ct_Chng_Q4_Q1, dtype: int64 Only displaying first 10 of 830 values. -------------------------------------------------- Unique values in Avg_Utilization_Ratio are : 0.000 2470 0.073 44 0.057 33 0.048 32 0.060 30 0.061 29 0.045 29 0.059 28 0.069 28 0.053 27 Name: Avg_Utilization_Ratio, dtype: int64 Only displaying first 10 of 964 values. --------------------------------------------------
Observations:
Attrition Flag - There are 2 unique values - "Exisiting" & "Attrited" CustomersCustomer Age - It's a continuos data and can be consdiered for int typeGender - There are 2 unique values - 'M' & 'F'. Dependent Count - Uniue range values from 0-6. Feature can be considered for Category typeEducation level - There are 6 unique values with few NaN. Missing values to be treated. Feature can be considered for Category typeMaritial Status - There are 3 unique values - "Married", "Divorced" & "Single" Customers. Missing values to be treated. Feature can be considered for Category typeIncome Category - There are 5 unique types. One of the type is "abc" which is incorrect. Need to be treated for NaN and missing values to be treated. Feature can be considered for Category typeCard Category - There are 4 unique values and this fature can be considered as a categoryMonths on Book - It's a continuos data and can be consdiered for int typeTotal Relationship - There are 6 definite values. It can be considered for category typeMonths_Inactive_12_mon - There are 7 definite values. There is one 0 value. It could be a missing value or possibly an active card on all months. It can be considered for category typeContacts_Count_12_mon - There are 7 definite values. There is one 0 value. It could be a missing value or possibly an active card on all months. It can be considered for category typeInferences:
df["Attrition_Flag"] = df["Attrition_Flag"].astype("category")
df["Gender"] = df["Gender"].astype("category")
df["Dependent_count"] = df["Dependent_count"].astype("category")
df["Education_Level"] = df["Education_Level"].astype("category")
df["Marital_Status"] = df["Marital_Status"].astype("category")
df["Income_Category"] = df["Income_Category"].astype("category")
df["Card_Category"] = df["Card_Category"].astype("category")
df["Total_Relationship_Count"] = df["Total_Relationship_Count"].astype("category")
df["Months_Inactive_12_mon"] = df["Months_Inactive_12_mon"].astype("category")
df["Contacts_Count_12_mon"] = df["Contacts_Count_12_mon"].astype("category")
# The incorrect type in Income ccategory is replaced with NaN and will be addressed as part of missing values
df.Income_Category = df.Income_Category.replace("abc", np.nan)
# Replacing the text values of the Target Variable with 0 (Existing) & 1 (Attrition)
att_flag = {"Existing Customer": 0, "Attrited Customer": 1}
df["Attrition_Flag"] = df["Attrition_Flag"].map(att_flag)
#df.Attrition_Flag = df.Attrition_Flag.replace("Existing Customer", 0)
#df.Attrition_Flag = df.Attrition_Flag.replace("Attrited Customer", 1)
# Defining bins for splitting the age to groups and creating a new column to review the relationship
bins = [20, 30, 40, 50, 60, 70, 80]
labels = [
"Less_than_30",
"Less_than_40",
"Less_than_50",
"Less_than_60",
"Less_than_70",
"Less_than_80",
]
df["AgeGroup"] = pd.cut(df["Customer_Age"], bins=bins, labels=labels, right=False)
df["AgeGroup"] = df["AgeGroup"].astype("category")
df["AgeGroup"].value_counts(dropna=False)
Less_than_50 4561 Less_than_60 2998 Less_than_40 1841 Less_than_70 530 Less_than_30 195 Less_than_80 2 Name: AgeGroup, dtype: int64
# Defining bins for splitting the relationship years to groups and creating a new column to review the relationship
bins = [0, 12, 24, 36, 48, 60, 72]
labels = [
"Between_0-1_Year",
"Between_1-2_Year",
"Between_2-3_Year",
"Between_3-4_Year",
"Between_4-5_Year",
"Between_5-6_Year",
]
df["Months_on_book_Grp"] = pd.cut(
df["Months_on_book"], bins=bins, labels=labels, right=False
)
df["Months_on_book_Grp"] = df["Months_on_book_Grp"].astype("category")
df["Months_on_book_Grp"].value_counts(dropna=False)
Between_3-4_Year 5508 Between_2-3_Year 3115 Between_4-5_Year 817 Between_1-2_Year 687 Between_0-1_Year 0 Between_5-6_Year 0 Name: Months_on_book_Grp, dtype: int64
# Defining bins for splitting the credit limits to groups and creating a new column to review the relationship
bins = [0, 5000, 10000, 15000, 20000, 25000, 40000]
labels = [
"<5K",
"Between_5K-10K",
"Between_10K-15K",
"Between_15K-20K",
"Between_20K-25K",
">25K",
]
df["Credit_Limit_Grp"] = pd.cut(
df["Credit_Limit"], bins=bins, labels=labels, right=False
)
df["Credit_Limit_Grp"] = df["Credit_Limit_Grp"].astype("category")
df["Credit_Limit_Grp"].value_counts(dropna=False)
<5K 5358 Between_5K-10K 2015 Between_10K-15K 941 >25K 892 Between_15K-20K 549 Between_20K-25K 372 Name: Credit_Limit_Grp, dtype: int64
# Observing the data dictionery after the changes
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 9015 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null category 10 Months_Inactive_12_mon 10127 non-null category 11 Contacts_Count_12_mon 10127 non-null category 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 20 AgeGroup 10127 non-null category 21 Months_on_book_Grp 10127 non-null category 22 Credit_Limit_Grp 10127 non-null category dtypes: category(13), float64(5), int64(5) memory usage: 922.6 KB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 10127 row samples and 23 attributes of the customer information collected in this dataset.
# Identofying the category columns
category_columnNames = df.describe(include=["category"]).columns
category_columnNames
Index(['Attrition_Flag', 'Gender', 'Dependent_count', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'AgeGroup', 'Months_on_book_Grp',
'Credit_Limit_Grp'],
dtype='object')
# Identifying the numerical columns
number_columnNames = (
df.describe(include=["int64"]).columns.tolist()
+ df.describe(include=["float64"]).columns.tolist()
)
number_columnNames
['Customer_Age', 'Months_on_book', 'Total_Revolving_Bal', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
df.describe(include="category").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | 0 | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Dependent_count | 10127 | 6 | 3 | 2732 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 9015 | 5 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
| Total_Relationship_Count | 10127 | 6 | 3 | 2305 |
| Months_Inactive_12_mon | 10127 | 7 | 3 | 3846 |
| Contacts_Count_12_mon | 10127 | 7 | 3 | 3380 |
| AgeGroup | 10127 | 6 | Less_than_50 | 4561 |
| Months_on_book_Grp | 10127 | 4 | Between_3-4_Year | 5508 |
| Credit_Limit_Grp | 10127 | 6 | <5K | 5358 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
Data Structure:
Data Cleaning:
Client Number attribute is not required and the column was droppedEducation_Level, Marital_Status & Income Category features in the dataset and will be addressed during modelingData Description:
For more details, Click here for Data descriptions & Click here for Feature Value observations
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, hueCol=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 7))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
hue=hueCol,
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
# annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True,).sort_values(
by=sorter, ascending=False
)
print("-" * 30, " Volume ", "-" * 30)
print(tab1)
tab1 = pd.crosstab(
data[predictor], data[target], margins=True, normalize="index"
).sort_values(by=sorter, ascending=False)
print("-" * 30, " Percentage % ", "-" * 30)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Creating a common function to draw a Boxplot & a Histogram for each of the analysis
def histogram_boxplot(data, feature, figsize=(15, 7), kde=True, bins=None):
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
if bins:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
)
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# functions to treat outliers by flooring and capping
def treat_outliers(df, col, lower=0.25, upper=0.75, mul=1.5):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(lower) # 25th quantile
Q3 = df[col].quantile(upper) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - (mul * IQR)
Upper_Whisker = Q3 + (mul * IQR)
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list, lower=0.25, upper=0.75, mul=1.5):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c, lower, upper, mul)
return df
# Summary of data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127.0 | 2.0 | 0.0 | 8500.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.0 | NaN | NaN | NaN | 46.32596 | 8.016814 | 26.0 | 41.0 | 46.0 | 52.0 | 73.0 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.0 | 6.0 | 3.0 | 2732.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 9015 | 5 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.0 | NaN | NaN | NaN | 35.928409 | 7.986416 | 13.0 | 31.0 | 36.0 | 40.0 | 56.0 |
| Total_Relationship_Count | 10127.0 | 6.0 | 3.0 | 2305.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_Inactive_12_mon | 10127.0 | 7.0 | 3.0 | 3846.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Contacts_Count_12_mon | 10127.0 | 7.0 | 3.0 | 3380.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Credit_Limit | 10127.0 | NaN | NaN | NaN | 8631.953698 | 9088.77665 | 1438.3 | 2555.0 | 4549.0 | 11067.5 | 34516.0 |
| Total_Revolving_Bal | 10127.0 | NaN | NaN | NaN | 1162.814061 | 814.987335 | 0.0 | 359.0 | 1276.0 | 1784.0 | 2517.0 |
| Avg_Open_To_Buy | 10127.0 | NaN | NaN | NaN | 7469.139637 | 9090.685324 | 3.0 | 1324.5 | 3474.0 | 9859.0 | 34516.0 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | NaN | NaN | NaN | 4404.086304 | 3397.129254 | 510.0 | 2155.5 | 3899.0 | 4741.0 | 18484.0 |
| Total_Trans_Ct | 10127.0 | NaN | NaN | NaN | 64.858695 | 23.47257 | 10.0 | 45.0 | 67.0 | 81.0 | 139.0 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | NaN | NaN | NaN | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
| AgeGroup | 10127 | 6 | Less_than_50 | 4561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book_Grp | 10127 | 4 | Between_3-4_Year | 5508 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Credit_Limit_Grp | 10127 | 6 | <5K | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# printing the number of occurrences of each unique value in each categorical column
num_to_display = 15
for column in category_columnNames:
val_counts = df[column].value_counts(
dropna=False
) # Kept dropNA to False to see the NA value count as well
#val_countsP = df[column].value_counts(dropna=False, normalize=True)
print("Unique values in", column, "are :")
print(val_counts.iloc[:num_to_display])
#print(val_countsP.iloc[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
labeled_barplot(df, column, perc=True, n=5)
plt.tight_layout()
print("-" * 50)
print(" ")
Unique values in Attrition_Flag are : 0 8500 1 1627 Name: Attrition_Flag, dtype: int64
-------------------------------------------------- Unique values in Gender are : F 5358 M 4769 Name: Gender, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Dependent_count are : 3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Education_Level are : Graduate 3128 High School 2013 NaN 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Marital_Status are : Married 4687 Single 3943 NaN 749 Divorced 748 Name: Marital_Status, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Income_Category are : Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 NaN 1112 $120K + 727 Name: Income_Category, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Card_Category are : Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Total_Relationship_Count are : 3 2305 4 1912 5 1891 6 1866 2 1243 1 910 Name: Total_Relationship_Count, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Months_Inactive_12_mon are : 3 3846 2 3282 1 2233 4 435 5 178 6 124 0 29 Name: Months_Inactive_12_mon, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Contacts_Count_12_mon are : 3 3380 2 3227 1 1499 4 1392 0 399 5 176 6 54 Name: Contacts_Count_12_mon, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in AgeGroup are : Less_than_50 4561 Less_than_60 2998 Less_than_40 1841 Less_than_70 530 Less_than_30 195 Less_than_80 2 Name: AgeGroup, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Months_on_book_Grp are : Between_3-4_Year 5508 Between_2-3_Year 3115 Between_4-5_Year 817 Between_1-2_Year 687 Between_0-1_Year 0 Between_5-6_Year 0 Name: Months_on_book_Grp, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Credit_Limit_Grp are : <5K 5358 Between_5K-10K 2015 Between_10K-15K 941 >25K 892 Between_15K-20K 549 Between_20K-25K 372 Name: Credit_Limit_Grp, dtype: int64
<Figure size 432x288 with 0 Axes>
--------------------------------------------------
<Figure size 432x288 with 0 Axes>
Observations:
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
# Summary of numeric data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations:
histogram_boxplot(df, "Customer_Age")
Observations:
histogram_boxplot(df, "Months_on_book")
Observations:
df[df.Months_on_book > 50]["Months_on_book"].describe()
count 418.000000 mean 53.535885 std 1.838805 min 51.000000 25% 52.000000 50% 53.000000 75% 55.000000 max 56.000000 Name: Months_on_book, dtype: float64
histogram_boxplot(df, "Credit_Limit")
df[(df['Credit_Limit'] > 20000)]["Income_Category"].value_counts()
$80K - $120K 520 $120K + 344 $60K - $80K 233 $40K - $60K 27 Less than $40K 0 Name: Income_Category, dtype: int64
# Finding the median values of the Credit Limit with respective to the Income type
df.groupby(["Income_Category"])[["Credit_Limit"]].median()
| Credit_Limit | |
|---|---|
| Income_Category | |
| $120K + | 18442.0 |
| $40K - $60K | 3682.0 |
| $60K - $80K | 7660.0 |
| $80K - $120K | 12830.0 |
| Less than $40K | 2766.0 |
# Finding the median values of the Credit Limit with respective to the Card Category
df.groupby(["Card_Category"])[["Credit_Limit"]].median()
| Credit_Limit | |
|---|---|
| Card_Category | |
| Blue | 4105.0 |
| Gold | 34516.0 |
| Platinum | 34516.0 |
| Silver | 29808.0 |
df.groupby(["Income_Category", "Card_Category"])[["Credit_Limit"]].median()
| Credit_Limit | ||
|---|---|---|
| Income_Category | Card_Category | |
| $120K + | Blue | 15769.0 |
| Gold | 34516.0 | |
| Platinum | 34516.0 | |
| Silver | 34516.0 | |
| $40K - $60K | Blue | 3454.0 |
| Gold | 23981.0 | |
| Platinum | 23981.0 | |
| Silver | 17304.0 | |
| $60K - $80K | Blue | 6784.0 |
| Gold | 34516.0 | |
| Platinum | 34516.0 | |
| Silver | 29810.0 | |
| $80K - $120K | Blue | 11617.0 |
| Gold | 34516.0 | |
| Platinum | 34516.0 | |
| Silver | 34516.0 | |
| Less than $40K | Blue | 2705.0 |
| Gold | 15987.0 | |
| Platinum | 15987.0 | |
| Silver | 12319.5 |
df['Credit_Limit'] = np.where(
((df['Credit_Limit'] > 20000) & (df['Income_Category'] == "$40K - $60K")) , 3682, df['Credit_Limit'])
df['Credit_Limit'] = np.where(
((df['Credit_Limit'] > 20000) & (df['Income_Category'] == "$60K - $80K")) , 7660, df['Credit_Limit'])
df['Credit_Limit'] = np.where(
((df['Credit_Limit'] > 20000) & (df['Income_Category'] == "$80K - $120K")) , 12830, df['Credit_Limit'])
df['Credit_Limit'] = np.where(
((df['Credit_Limit'] > 20000) & (df['Income_Category'] == "$120K +")) , 18442, df['Credit_Limit'])
df['Credit_Limit'] = np.where(
((df['Credit_Limit'] > 20000) & (df['Income_Category'].isna())) , df['Credit_Limit'].mean(), df['Credit_Limit'])
df[(df["Credit_Limit"] > 20000)].head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | AgeGroup | Months_on_book_Grp | Credit_Limit_Grp |
|---|
histogram_boxplot(df, "Credit_Limit")
histogram_boxplot(df, "Total_Revolving_Bal")
Observations:
histogram_boxplot(df, "Avg_Open_To_Buy")
Observations:
df[(df["Avg_Open_To_Buy"] > 20000)]["Income_Category"].value_counts()
$80K - $120K 486 $120K + 328 $60K - $80K 205 $40K - $60K 22 Less than $40K 0 Name: Income_Category, dtype: int64
# Finding the median values of the Avg_Open_To_Buy with respective to the Income type
df.groupby(["Income_Category"])[["Avg_Open_To_Buy"]].median()
| Avg_Open_To_Buy | |
|---|---|
| Income_Category | |
| $120K + | 17117.0 |
| $40K - $60K | 2580.5 |
| $60K - $80K | 6418.5 |
| $80K - $120K | 11606.0 |
| Less than $40K | 1478.0 |
df["Avg_Open_To_Buy"] = np.where(
((df["Avg_Open_To_Buy"] > 20000) & (df["Income_Category"] == "$40K - $60K")),
2580.5,
df["Avg_Open_To_Buy"],
)
df["Avg_Open_To_Buy"] = np.where(
((df["Avg_Open_To_Buy"] > 20000) & (df["Income_Category"] == "$60K - $80K")),
6418.5,
df["Avg_Open_To_Buy"],
)
df["Avg_Open_To_Buy"] = np.where(
((df["Avg_Open_To_Buy"] > 20000) & (df["Income_Category"] == "$80K - $120K")),
11606,
df["Avg_Open_To_Buy"],
)
df["Avg_Open_To_Buy"] = np.where(
((df["Avg_Open_To_Buy"] > 20000) & (df["Income_Category"] == "$120K +")),
17117,
df["Avg_Open_To_Buy"],
)
df["Avg_Open_To_Buy"] = np.where(
((df["Avg_Open_To_Buy"] > 20000) & (df["Income_Category"].isna())),
df["Avg_Open_To_Buy"].mean(),
df["Avg_Open_To_Buy"],
)
histogram_boxplot(df, "Avg_Open_To_Buy")
Observations:
histogram_boxplot(df, "Total_Amt_Chng_Q4_Q1")
Observations:
df = treat_outliers(df, "Total_Amt_Chng_Q4_Q1", 0.25, 0.75, 1.5)
histogram_boxplot(df, "Total_Amt_Chng_Q4_Q1")
histogram_boxplot(df, "Total_Ct_Chng_Q4_Q1")
Observations:
df = treat_outliers(df, "Total_Ct_Chng_Q4_Q1", 0.25, 0.75, 1.5)
histogram_boxplot(df, "Total_Ct_Chng_Q4_Q1")
histogram_boxplot(df, "Total_Trans_Amt")
df.groupby(["Income_Category"])[["Credit_Limit"]].median()
| Credit_Limit | |
|---|---|
| Income_Category | |
| $120K + | 18442.0 |
| $40K - $60K | 3682.0 |
| $60K - $80K | 7656.5 |
| $80K - $120K | 12830.0 |
| Less than $40K | 2766.0 |
df.groupby(["Card_Category"])[["Credit_Limit"]].median()
| Credit_Limit | |
|---|---|
| Card_Category | |
| Blue | 4105.0 |
| Gold | 12830.0 |
| Platinum | 10245.0 |
| Silver | 12830.0 |
df.groupby(["Card_Category", "Income_Category"])[["Credit_Limit"]].median()
| Credit_Limit | ||
|---|---|---|
| Card_Category | Income_Category | |
| Blue | $120K + | 15769.0 |
| $40K - $60K | 3454.0 | |
| $60K - $80K | 6784.0 | |
| $80K - $120K | 11617.0 | |
| Less than $40K | 2705.0 | |
| Gold | $120K + | 18442.0 |
| $40K - $60K | 3682.0 | |
| $60K - $80K | 7660.0 | |
| $80K - $120K | 12830.0 | |
| Less than $40K | 15987.0 | |
| Platinum | $120K + | 18442.0 |
| $40K - $60K | 3682.0 | |
| $60K - $80K | 7660.0 | |
| $80K - $120K | 12830.0 | |
| Less than $40K | 15987.0 | |
| Silver | $120K + | 18442.0 |
| $40K - $60K | 16406.0 | |
| $60K - $80K | 7660.0 | |
| $80K - $120K | 12830.0 | |
| Less than $40K | 12319.5 |
df.groupby(["Card_Category", "Income_Category"])[["Total_Trans_Amt"]].median()
| Total_Trans_Amt | ||
|---|---|---|
| Card_Category | Income_Category | |
| Blue | $120K + | 3453.0 |
| $40K - $60K | 3918.0 | |
| $60K - $80K | 3444.0 | |
| $80K - $120K | 3448.0 | |
| Less than $40K | 4084.0 | |
| Gold | $120K + | 7897.5 |
| $40K - $60K | 13847.0 | |
| $60K - $80K | 7582.0 | |
| $80K - $120K | 5547.0 | |
| Less than $40K | 7370.0 | |
| Platinum | $120K + | 10896.5 |
| $40K - $60K | 4758.0 | |
| $60K - $80K | 11427.0 | |
| $80K - $120K | 7504.5 | |
| Less than $40K | 8059.5 | |
| Silver | $120K + | 4485.5 |
| $40K - $60K | 4232.0 | |
| $60K - $80K | 4055.0 | |
| $80K - $120K | 4544.0 | |
| Less than $40K | 4699.0 |
Observations:
histogram_boxplot(df, "Total_Trans_Ct")
Observations:
histogram_boxplot(df, "Avg_Utilization_Ratio")
Observations:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 9015 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null category 10 Months_Inactive_12_mon 10127 non-null category 11 Contacts_Count_12_mon 10127 non-null category 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 20 AgeGroup 10127 non-null category 21 Months_on_book_Grp 10127 non-null category 22 Credit_Limit_Grp 10127 non-null category dtypes: category(13), float64(5), int64(5) memory usage: 922.6 KB
# Data Secription of Categorical variables
df.describe(include="category").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | 0 | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Dependent_count | 10127 | 6 | 3 | 2732 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 9015 | 5 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
| Total_Relationship_Count | 10127 | 6 | 3 | 2305 |
| Months_Inactive_12_mon | 10127 | 7 | 3 | 3846 |
| Contacts_Count_12_mon | 10127 | 7 | 3 | 3380 |
| AgeGroup | 10127 | 6 | Less_than_50 | 4561 |
| Months_on_book_Grp | 10127 | 4 | Between_3-4_Year | 5508 |
| Credit_Limit_Grp | 10127 | 6 | <5K | 5358 |
# Data Secription of Categorical variables
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Credit_Limit | 10127.0 | 6528.924665 | 5023.257606 | 1438.300 | 2555.000 | 4507.000 | 9435.000 | 19999.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 5453.166224 | 5172.994080 | 3.000 | 1324.500 | 3440.000 | 8436.000 | 19995.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.751387 | 0.184542 | 0.289 | 0.631 | 0.736 | 0.859 | 1.201 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.703484 | 0.197203 | 0.228 | 0.582 | 0.702 | 0.818 | 1.172 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
count = df[cols].nunique()
sorter = df["Attrition_Flag"].value_counts(dropna=False).index[-1]
tab1 = pd.crosstab(df[cols], df["Attrition_Flag"], margins=True,).sort_values(
by=sorter, ascending=False
)
print("-" * 30, " Volume ", "-" * 30)
print(tab1)
tab1 = pd.crosstab(
df[cols], df["Attrition_Flag"], margins=True, normalize="index"
).sort_values(by=sorter, ascending=False)
print("-" * 30, " Percentage % ", "-" * 30)
print(tab1)
print("-" * 120)
labeled_barplot(df, cols, perc=True, n=10, hueCol="Attrition_Flag")
plt.tight_layout()
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Attrition_Flag 1 1627 0 1627 All 1627 8500 10127 0 0 8500 8500 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Attrition_Flag 1 1.00000 0.00000 All 0.16066 0.83934 0 0.00000 1.00000 ------------------------------------------------------------------------------------------------------------------------
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Gender F 0.173572 0.826428 All 0.160660 0.839340 M 0.146152 0.853848 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Dependent_count 3 0.176428 0.823572 4 0.165184 0.834816 All 0.160660 0.839340 2 0.157062 0.842938 5 0.150943 0.849057 0 0.149336 0.850664 1 0.146355 0.853645 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Education_Level Doctorate 0.210643 0.789357 Post-Graduate 0.178295 0.821705 Uneducated 0.159381 0.840619 All 0.159270 0.840730 Graduate 0.155691 0.844309 College 0.152024 0.847976 High School 0.152012 0.847988 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Marital_Status Single 0.169414 0.830586 Divorced 0.161765 0.838235 All 0.159736 0.840264 Married 0.151269 0.848731 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Income_Category All 1440 7575 9015 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 $120K + 126 601 727 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Income_Category $120K + 0.173315 0.826685 Less than $40K 0.171862 0.828138 All 0.159734 0.840266 $80K - $120K 0.157655 0.842345 $40K - $60K 0.151397 0.848603 $60K - $80K 0.134807 0.865193 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Card_Category Platinum 0.250000 0.750000 Gold 0.181034 0.818966 Blue 0.160979 0.839021 All 0.160660 0.839340 Silver 0.147748 0.852252 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Total_Relationship_Count 2 0.278359 0.721641 1 0.256044 0.743956 3 0.173536 0.826464 All 0.160660 0.839340 5 0.120042 0.879958 4 0.117678 0.882322 6 0.105038 0.894962 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Months_Inactive_12_mon 0 0.517241 0.482759 4 0.298851 0.701149 3 0.214769 0.785231 5 0.179775 0.820225 All 0.160660 0.839340 2 0.153870 0.846130 6 0.153226 0.846774 1 0.044783 0.955217 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Contacts_Count_12_mon 6 1.000000 0.000000 5 0.335227 0.664773 4 0.226293 0.773707 3 0.201479 0.798521 All 0.160660 0.839340 2 0.124884 0.875116 1 0.072048 0.927952 0 0.017544 0.982456 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All AgeGroup All 1627 8500 10127 Less_than_50 772 3789 4561 Less_than_60 506 2492 2998 Less_than_40 261 1580 1841 Less_than_70 71 459 530 Less_than_30 17 178 195 Less_than_80 0 2 2 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 AgeGroup Less_than_50 0.169261 0.830739 Less_than_60 0.168779 0.831221 All 0.160660 0.839340 Less_than_40 0.141771 0.858229 Less_than_70 0.133962 0.866038 Less_than_30 0.087179 0.912821 Less_than_80 0.000000 1.000000 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Months_on_book_Grp All 1627 8500 10127 Between_3-4_Year 922 4586 5508 Between_2-3_Year 469 2646 3115 Between_4-5_Year 138 679 817 Between_1-2_Year 98 589 687 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Months_on_book_Grp Between_4-5_Year 0.168911 0.831089 Between_3-4_Year 0.167393 0.832607 All 0.160660 0.839340 Between_2-3_Year 0.150562 0.849438 Between_1-2_Year 0.142649 0.857351 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Attrition_Flag 1 0 All Credit_Limit_Grp All 1627 8500 10127 <5K 926 4432 5358 Between_5K-10K 302 1713 2015 Between_10K-15K 145 796 941 >25K 141 751 892 Between_15K-20K 70 479 549 Between_20K-25K 43 329 372 ------------------------------ Percentage % ------------------------------ Attrition_Flag 1 0 Credit_Limit_Grp <5K 0.172826 0.827174 All 0.160660 0.839340 >25K 0.158072 0.841928 Between_10K-15K 0.154091 0.845909 Between_5K-10K 0.149876 0.850124 Between_15K-20K 0.127505 0.872495 Between_20K-25K 0.115591 0.884409 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
plt.figure(figsize=(15, 20))
line_columnnames=['Gender', 'Dependent_count', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category',
'Total_Relationship_Count', 'Months_Inactive_12_mon']
for i, variable in enumerate(line_columnnames):
plt.subplot(5, 2, i + 1)
sns.lineplot(x=variable, y="Attrition_Flag", data=df)
plt.tight_layout()
plt.title(variable)
plt.show()
Attrition_Flag vs Gender
Attrition_Flag vs Dependent_count
Attrition_Flag vs Education_Level
Attrition_Flag vs Marital_Status
Attrition_Flag vs Income_Category
Attrition_Flag vs Card_Category
Attrition_Flag vs Total_Relationship_Count
Attrition_Flag vs Months_Inactive_12_mon
Attrition_Flag vs Contacts_Count_12_mon
Attrition_Flag vs Age_Group
Attrition_Flag vs Months_on_book_Grp
Attrition_Flag vs Credit_Limit_Grp
plt.figure(figsize=(15, 25))
for i, variable in enumerate(number_columnNames):
plt.subplot(5, 2, i + 1)
sns.boxplot(df["Attrition_Flag"], df[variable], palette="PuBu", showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
plt.figure(figsize=(15, 20))
line_columnnames=['Customer_Age', 'Months_on_book', 'Total_Trans_Ct']
for i, variable in enumerate(line_columnnames):
plt.subplot(5, 2, i + 1)
sns.lineplot(x=variable, y="Attrition_Flag", data=df)
plt.tight_layout()
plt.title(variable)
plt.show()
Attrition_Flag vs Customer Age
Attrition_Flag vs Months on Book
Attrition_Flag vs Total Revolving Bal
Attrition_Flag vs Total Trans Amount vs Total Trans Ct
Attrition_Flag vs Credit Limit
Attrition_Flag vs Avg. Open to Buy
Attrition_Flag vs Total Amt Change Q4-Q1 vs Total Ct Change Q4-Q1
Attrition_Flag vs Avg. Utilization Ratio
# Plotting Heatmap by creating a 2-D Matrix with correlation plots
correlation = df.corr()
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, vmin=-1, vmax=1, annot=True, cmap="Spectral")
<AxesSubplot:>
sns.pairplot(df, corner=True, hue="Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x1f3409cab20>
Customer Age and Months on Book has a high correlation which is normal. With increase in age, the number of years in relation to the bank increasesAvg_Open_To_Buy (Amount left on the credit card to use) has negative correlation with Total Remaining Balance & Avg_Utilization_Ratio which is expected. When revolving balance is high, there will be less credit amount to buyAvg_Open_To_Buy (Amount left on the credit card to use) has positive correlation with the Credit Limit. The more the credit limit, the amount left on the card also is moreCredit Limit has a negative corelation with Avg_Utilization_Ratio. When the credit is less, Avg. utilization ratio will be higher since due to the low credit limit most of the amount will be utilized. Credit Limit has a positive correlation with Total Trans Amount. When the credit limit is less, the Transaction amount to be sent will also be lessTotal Trans Amount has a positive correlation with Avg. Open to Buy. As the amount available to buy is more, the amount of transactions spent also increasesTotal Trans Amount has a positive correlation with Total_Trans_Ct. With increase in count of transactions, the amout spent will also increaseAvg_Utilization_Ratio (Amount of credit available spent) has high correlation with Total Revolving Balance & has negative correlation with Avg_Open_To_Buy which is expected. The revolving balance will be more when the utlization is high which is tems will reduce the credit open to buy Total_Ct_Chng_Q4_Q1 has +ve correlation with Total_Amt_Chng_Q4_Q1. With increase in count of transactions, the amout spent will also increaseCard_Category vs Income_Category
tab = pd.crosstab(df["Card_Category"], df["Income_Category"], normalize="index")
tab.plot(kind="bar", stacked=True)
plt.show()
Observations:
Card_Category vs Income_Category
tab = pd.crosstab(df["AgeGroup"], df["Months_on_book_Grp"], normalize="index")
tab.plot(kind="bar", stacked=True)
plt.show()
Observations:
Total_Trans_Amt vs Income_Category vs Attrition_Flag
plt.figure(figsize=(15, 7))
sns.boxplot(x="Total_Trans_Amt", y="Income_Category", data=df, hue="Attrition_Flag")
plt.show()
plt.figure(figsize=(15, 7))
sns.boxplot(x="Dependent_count", y="Total_Trans_Amt", data=df, hue="Attrition_Flag")
plt.show()
Observations:
Income Category vs Credit_Limit vs Customer Age vs Attrition_Flag
g = sns.FacetGrid(
df, col="Income_Category", hue="Attrition_Flag", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "Credit_Limit", "Customer_Age")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1f33b1146d0>
Observations:
Income Category vs Customer_Age vs Total_Trans_Amt vs Attrition_Flag
g = sns.FacetGrid(
df, col="Income_Category", hue="Attrition_Flag", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "Total_Trans_Amt", "Customer_Age")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1f341ea4a00>
Total_Trans_Amt vs Card_Category vs Attrition_Flag
g = sns.FacetGrid(
df, col="Card_Category", hue="Attrition_Flag", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "Credit_Limit", "Total_Trans_Amt")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1f341edc760>
plt.figure(figsize=(15, 7))
sns.boxplot(x="Card_Category", y="Total_Trans_Amt", data=df, hue="Attrition_Flag")
plt.show()
Observations
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
#target, pred
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1] # Probability answer.
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 9015 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null category 10 Months_Inactive_12_mon 10127 non-null category 11 Contacts_Count_12_mon 10127 non-null category 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 20 AgeGroup 10127 non-null category 21 Months_on_book_Grp 10127 non-null category 22 Credit_Limit_Grp 10127 non-null category dtypes: category(13), float64(5), int64(5) memory usage: 922.6 KB
X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]
# Dropping off the following columns since they will not play a part in determing the model for the customers purchasing the new product
X.drop(["AgeGroup"], axis=1, inplace=True)
X.drop(["Months_on_book_Grp"], axis=1, inplace=True)
X.drop(["Credit_Limit_Grp"], axis=1, inplace=True)
X.drop(["Avg_Utilization_Ratio"], axis=1, inplace=True)
X.drop(["Total_Ct_Chng_Q4_Q1"], axis=1, inplace=True)
X.drop(["Total_Amt_Chng_Q4_Q1"], axis=1, inplace=True)
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 16) (2026, 16) (2026, 16)
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])
# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])
# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Trans_Amt 0 Total_Trans_Ct 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Trans_Amt 0 Total_Trans_Ct 0 dtype: int64
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
print("Shape of X Training set : ", X_train.shape)
print("Shape of X validation set : ", X_val.shape)
print("Shape of X test set : ", X_test.shape)
print("")
print("Shape of Y Training set : ", y_train.shape)
print("Shape of Y test set : ", y_val.shape)
print("Shape of Y test set : ", y_test.shape)
print("")
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("")
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("")
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of X Training set : (6075, 44) Shape of X validation set : (2026, 44) Shape of X test set : (2026, 44) Shape of Y Training set : (6075,) Shape of Y test set : (2026,) Shape of Y test set : (2026,) Percentage of classes in training set: 0 0.839342 1 0.160658 Name: Attrition_Flag, dtype: float64 Percentage of classes in validation set: 0 0.839092 1 0.160908 Name: Attrition_Flag, dtype: float64 Percentage of classes in test set: 0 0.839585 1 0.160415 Name: Attrition_Flag, dtype: float64
X_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2026 entries, 9760 to 413 Data columns (total 44 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 2026 non-null int64 1 Months_on_book 2026 non-null int64 2 Credit_Limit 2026 non-null float64 3 Total_Revolving_Bal 2026 non-null int64 4 Avg_Open_To_Buy 2026 non-null float64 5 Total_Trans_Amt 2026 non-null int64 6 Total_Trans_Ct 2026 non-null int64 7 Gender_M 2026 non-null uint8 8 Dependent_count_1 2026 non-null uint8 9 Dependent_count_2 2026 non-null uint8 10 Dependent_count_3 2026 non-null uint8 11 Dependent_count_4 2026 non-null uint8 12 Dependent_count_5 2026 non-null uint8 13 Education_Level_Doctorate 2026 non-null uint8 14 Education_Level_Graduate 2026 non-null uint8 15 Education_Level_High School 2026 non-null uint8 16 Education_Level_Post-Graduate 2026 non-null uint8 17 Education_Level_Uneducated 2026 non-null uint8 18 Marital_Status_Married 2026 non-null uint8 19 Marital_Status_Single 2026 non-null uint8 20 Income_Category_$40K - $60K 2026 non-null uint8 21 Income_Category_$60K - $80K 2026 non-null uint8 22 Income_Category_$80K - $120K 2026 non-null uint8 23 Income_Category_Less than $40K 2026 non-null uint8 24 Card_Category_Gold 2026 non-null uint8 25 Card_Category_Platinum 2026 non-null uint8 26 Card_Category_Silver 2026 non-null uint8 27 Total_Relationship_Count_2 2026 non-null uint8 28 Total_Relationship_Count_3 2026 non-null uint8 29 Total_Relationship_Count_4 2026 non-null uint8 30 Total_Relationship_Count_5 2026 non-null uint8 31 Total_Relationship_Count_6 2026 non-null uint8 32 Months_Inactive_12_mon_1 2026 non-null uint8 33 Months_Inactive_12_mon_2 2026 non-null uint8 34 Months_Inactive_12_mon_3 2026 non-null uint8 35 Months_Inactive_12_mon_4 2026 non-null uint8 36 Months_Inactive_12_mon_5 2026 non-null uint8 37 Months_Inactive_12_mon_6 2026 non-null uint8 38 Contacts_Count_12_mon_1 2026 non-null uint8 39 Contacts_Count_12_mon_2 2026 non-null uint8 40 Contacts_Count_12_mon_3 2026 non-null uint8 41 Contacts_Count_12_mon_4 2026 non-null uint8 42 Contacts_Count_12_mon_5 2026 non-null uint8 43 Contacts_Count_12_mon_6 2026 non-null uint8 dtypes: float64(2), int64(5), uint8(37) memory usage: 199.8 KB
models_list = {"Name": [], "CV_Score": [], "Model": []}
# Empty dictionery to store Recall Score Values
models1 = [] # Empty list to store all the models
# Appending models into the list
models1.append(
("Logistic Regression", LogisticRegression(solver="liblinear", random_state=1))
)
models1.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models1.append(("Bagging", BaggingClassifier(random_state=1)))
models1.append(("Random Forest", RandomForestClassifier(random_state=1)))
models1.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))
models1.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models1.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score1 = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models1:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
models_list["Name"].append(name)
models_list["CV_Score"].append(cv_result.mean() * 100)
models_list["Model"].append(model)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Recall Score - Validation Performance:" "\n")
for name, model in models1:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val)) * 100
score1.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic Regression: 32.38147566718995 Decision Tree: 74.17687074829932 Bagging: 73.04918890633176 Random Forest: 65.36630036630036 Gradient Boost: 78.06750392464677 Adaboost: 76.9403453689168 Xgboost: 79.91313448456306 Recall Score - Validation Performance: Logistic Regression: 50.920245398773 Decision Tree: 76.38036809815951 Bagging: 76.38036809815951 Random Forest: 69.01840490797547 Gradient Boost: 79.4478527607362 Adaboost: 80.98159509202453 Xgboost: 83.74233128834356
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison - Original Data")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Fit SMOTE on train data(Synthetic Minority Oversampling Technique)
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 976 Before OverSampling, count of label '0': 5099 After OverSampling, count of label '1': 5099 After OverSampling, count of label '0': 5099 After OverSampling, the shape of train_X: (10198, 44) After OverSampling, the shape of train_y: (10198,)
models2 = [] # Empty list to store all the models
# Appending models into the list
models2.append(
("Over Logistic Regression", LogisticRegression(solver="liblinear", random_state=1))
)
models2.append(("Over Decision Tree", DecisionTreeClassifier(random_state=1)))
models2.append(("Over Bagging", BaggingClassifier(random_state=1)))
models2.append(("Over Random Forest", RandomForestClassifier(random_state=1)))
models2.append(("Over Gradient Boost", GradientBoostingClassifier(random_state=1)))
models2.append(("Over Adaboost", AdaBoostClassifier(random_state=1)))
models2.append(("Over Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score2 = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models2:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
models_list["Name"].append(name)
models_list["CV_Score"].append(cv_result.mean() * 100)
models_list["Model"].append(model)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Recall Score - Validation Performance:" "\n")
for name, model in models2:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val)) * 100
score2.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Over Logistic Regression: 90.15520791240932 Over Decision Tree: 94.31268640920548 Over Bagging: 95.15626623564047 Over Random Forest: 95.74448228751756 Over Gradient Boost: 96.33281379283804 Over Adaboost: 96.1562469933999 Over Xgboost: 96.70532432796473 Recall Score - Validation Performance: Over Logistic Regression: 58.58895705521472 Over Decision Tree: 76.68711656441718 Over Bagging: 78.52760736196319 Over Random Forest: 72.39263803680981 Over Gradient Boost: 85.2760736196319 Over Adaboost: 82.82208588957054 Over Xgboost: 85.58282208588957
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison - Oversampling Data")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# fit random under sampler on the train data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, count of label '1': 976 Before Under Sampling, count of label '0': 5099 After Under Sampling, count of label '1': 976 After Under Sampling, count of label '0': 976 After Under Sampling, the shape of train_X: (1952, 44) After Under Sampling, the shape of train_y: (1952,)
models3 = [] # Empty list to store all the models
# Appending models into the list
models3.append(
(
"Under Logistic Regression",
LogisticRegression(solver="liblinear", random_state=1),
)
)
models3.append(("Under Decision Tree", DecisionTreeClassifier(random_state=1)))
models3.append(("Under Bagging", BaggingClassifier(random_state=1)))
models3.append(("Under Random Forest", RandomForestClassifier(random_state=1)))
models3.append(("Under Gradient Boost", GradientBoostingClassifier(random_state=1)))
models3.append(("Under Adaboost", AdaBoostClassifier(random_state=1)))
models3.append(("Under Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score3 = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models3:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
models_list["Name"].append(name)
models_list["CV_Score"].append(cv_result.mean() * 100)
models_list["Model"].append(model)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Recall Score - Validation Performance:" "\n")
for name, model in models3:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val)) * 100
score3.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Under Logistic Regression: 83.91732077446363 Under Decision Tree: 87.09105180533753 Under Bagging: 89.03976975405548 Under Random Forest: 90.98482469911042 Under Gradient Boost: 92.72736787022501 Under Adaboost: 91.70329670329672 Under Xgboost: 92.31711145996861 Recall Score - Validation Performance: Under Logistic Regression: 84.66257668711657 Under Decision Tree: 89.2638036809816 Under Bagging: 89.87730061349694 Under Random Forest: 92.33128834355828 Under Gradient Boost: 91.71779141104295 Under Adaboost: 92.02453987730061 Under Xgboost: 93.86503067484662
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison - Undersampling Data")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Picking 3 best models from the 7 x 3 Matrix (Regular Set, Over Sampling Set & Under Sampling Set)
models_df = pd.DataFrame(models_list)
models_df.sort_values(by=["CV_Score"], ascending=False)
| Name | CV_Score | Model | |
|---|---|---|---|
| 13 | Over Xgboost | 96.705324 | XGBClassifier(base_score=0.5, booster='gbtree'... |
| 11 | Over Gradient Boost | 96.332814 | ([DecisionTreeRegressor(criterion='friedman_ms... |
| 12 | Over Adaboost | 96.156247 | (DecisionTreeClassifier(max_depth=1, random_st... |
| 10 | Over Random Forest | 95.744482 | (DecisionTreeClassifier(max_features='auto', r... |
| 9 | Over Bagging | 95.156266 | (DecisionTreeClassifier(random_state=102886208... |
| 8 | Over Decision Tree | 94.312686 | DecisionTreeClassifier(random_state=1) |
| 18 | Under Gradient Boost | 92.727368 | ([DecisionTreeRegressor(criterion='friedman_ms... |
| 20 | Under Xgboost | 92.317111 | XGBClassifier(base_score=0.5, booster='gbtree'... |
| 19 | Under Adaboost | 91.703297 | (DecisionTreeClassifier(max_depth=1, random_st... |
| 17 | Under Random Forest | 90.984825 | (DecisionTreeClassifier(max_features='auto', r... |
| 7 | Over Logistic Regression | 90.155208 | LogisticRegression(random_state=1, solver='lib... |
| 16 | Under Bagging | 89.039770 | (DecisionTreeClassifier(random_state=102886208... |
| 15 | Under Decision Tree | 87.091052 | DecisionTreeClassifier(random_state=1) |
| 14 | Under Logistic Regression | 83.917321 | LogisticRegression(random_state=1, solver='lib... |
| 6 | Xgboost | 79.913134 | XGBClassifier(base_score=0.5, booster='gbtree'... |
| 4 | Gradient Boost | 78.067504 | ([DecisionTreeRegressor(criterion='friedman_ms... |
| 5 | Adaboost | 76.940345 | (DecisionTreeClassifier(max_depth=1, random_st... |
| 1 | Decision Tree | 74.176871 | DecisionTreeClassifier(random_state=1) |
| 2 | Bagging | 73.049189 | (DecisionTreeClassifier(random_state=102886208... |
| 3 | Random Forest | 65.366300 | (DecisionTreeClassifier(max_features='auto', r... |
| 0 | Logistic Regression | 32.381476 | LogisticRegression(random_state=1, solver='lib... |
Based on the above comparison, the following models have the best performing Recall scores in the order mentioned below:
1. "Over XG Boost" (96.71 %)
2. "Over Gradient Boost" (96.33 %)
3. "Over AdaBoost" (96.16 %)
Grid Search
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over, y_train_over)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Set the clf to the best combination of parameters
adb_tuned1 = grid_cv.best_estimator_
# Fit the model on training data
adb_tuned1.fit(X_train_over, y_train_over)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'learning_rate': 0.05, 'n_estimators': 30}
Score: 0.9321518597625508
Wall time: 3min 18s
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=30, random_state=1)
Checking model performance
# Calculating different metrics on train set
Adaboost_grid_train = model_performance_classification_sklearn_with_threshold(
adb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
print(Adaboost_grid_train)
print("*************************************")
Adaboost_grid_val = model_performance_classification_sklearn_with_threshold(
adb_tuned1, X_val, y_val
)
print("Validation performance:")
print(Adaboost_grid_val)
Training performance: Accuracy Recall Precision F1 0 0.939498 0.963718 0.919192 0.940929 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.911155 0.871166 0.672986 0.759358
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(adb_tuned1, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(adb_tuned1, X_val, y_val)
Randomized Search
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Set the clf to the best combination of parameters
adb_tuned2 = randomized_cv.best_estimator_
# Fit the model on training data
adb_tuned2.fit(X_train_over, y_train_over)
Best parameters are {'n_estimators': 10, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9305853489580326:
Wall time: 1min 7s
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=10, random_state=1)
Checking model performance
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn_with_threshold(
adb_tuned2, X_train_over, y_train_over
)
print("Training performance:")
print(Adaboost_random_train)
print("*************************************")
Adaboost_random_val = model_performance_classification_sklearn_with_threshold(
adb_tuned2, X_val, y_val
)
print("Validation performance:")
print(Adaboost_random_val)
Training performance: Accuracy Recall Precision F1 0 0.94695 0.964699 0.931629 0.947876 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.916091 0.855828 0.69403 0.766484
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(adb_tuned2, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(adb_tuned2, X_val, y_val)
Grid Search
%%time
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1],
"max_depth": [3, 5, 7],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over, y_train_over)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Set the clf to the best combination of parameters
gb_tuned1 = grid_cv.best_estimator_
# Fit the model on training data
gb_tuned1.fit(X_train_over, y_train_over)
Best Parameters:{'max_depth': 7, 'max_features': 0.9, 'n_estimators': 100, 'subsample': 1}
Score: 0.9149015759395024
Wall time: 13min 58s
GradientBoostingClassifier(max_depth=7, max_features=0.9, random_state=1,
subsample=1)
Checking model performance
# Calculating different metrics on train set
GB_grid_train = model_performance_classification_sklearn_with_threshold(
gb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
print(GB_grid_train)
print("*************************************")
GB_grid_val = model_performance_classification_sklearn_with_threshold(
gb_tuned1, X_val, y_val
)
print("Validation performance:")
print(GB_grid_val)
Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.950642 0.849693 0.844512 0.847095
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gb_tuned1, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gb_tuned1, X_val, y_val)
Randomized Search
%%time
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1],
"max_depth": [3, 5, 7],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Set the clf to the best combination of parameters
gb_tuned2 = randomized_cv.best_estimator_
# Fit the model on training data
gb_tuned2.fit(X_train_over, y_train_over)
Best parameters are {'subsample': 1, 'n_estimators': 100, 'max_features': 0.7, 'max_depth': 7} with CV score=0.9129405992033712:
Wall time: 4min 18s
GradientBoostingClassifier(max_depth=7, max_features=0.7, random_state=1,
subsample=1)
Checking model performance
# Calculating different metrics on train set
GB_random_train = model_performance_classification_sklearn_with_threshold(
gb_tuned2, X_train_over, y_train_over
)
print("Training performance:")
print(GB_random_train)
print("*************************************")
GB_random_val = model_performance_classification_sklearn_with_threshold(
gb_tuned2, X_val, y_val
)
print("Validation performance:")
print(GB_random_val)
Training performance: Accuracy Recall Precision F1 0 0.999902 0.999804 1.0 0.999902 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.950148 0.837423 0.850467 0.843895
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gb_tuned2, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gb_tuned2, X_val, y_val)
Grid Search
%%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid={
'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[5,10],
'learning_rate':[0.01,0.1],
'gamma':[0,1,3],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
xgb_tuned1 = grid_cv.best_estimator_
xgb_tuned1.fit(X_train_over, y_train_over)
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Best parameters are {'gamma': 0, 'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 100, 'scale_pos_weight': 10, 'subsample': 0.8} with CV score=0.9986274509803922:
Wall time: 10min 5s
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric='logloss', gamma=0, gpu_id=-1,
grow_policy='depthwise', importance_type=None,
interaction_constraints='', learning_rate=0.01, max_bin=256,
max_cat_to_onehot=4, max_delta_step=0, max_depth=2, max_leaves=0,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, predictor='auto',
random_state=1, reg_alpha=0, reg_lambda=1, ...)
# Calculating different metrics on train set
xgboost_grid_train = model_performance_classification_sklearn_with_threshold(
xgb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
print(xgboost_grid_train)
print("*************************************")
xgboost_grid_val = model_performance_classification_sklearn_with_threshold(
xgb_tuned1, X_val, y_val
)
print("Validation performance:")
print(xgboost_grid_train)
Training performance: Accuracy Recall Precision F1 0 0.707197 0.999804 0.630706 0.773479 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.707197 0.999804 0.630706 0.773479
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_tuned1, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_tuned1, X_val, y_val)
Randomized Search
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={
'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[5,10],
'learning_rate':[0.01,0.1],
'gamma':[0,1,3],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
xgb_tuned2 = randomized_cv.best_estimator_
xgb_tuned2.fit(X_train_over, y_train_over)
Best parameters are {'subsample': 1, 'scale_pos_weight': 10, 'n_estimators': 50, 'max_depth': 1, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.9972549019607844:
Wall time: 1min 37s
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric='logloss', gamma=3, gpu_id=-1,
grow_policy='depthwise', importance_type=None,
interaction_constraints='', learning_rate=0.1, max_bin=256,
max_cat_to_onehot=4, max_delta_step=0, max_depth=1, max_leaves=0,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=50, n_jobs=0, num_parallel_tree=1, predictor='auto',
random_state=1, reg_alpha=0, reg_lambda=1, ...)
# Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn_with_threshold(
xgb_tuned2, X_train_over, y_train_over
)
print("Training performance:")
print(xgboost_random_train)
print("*************************************")
xgboost_random_val = model_performance_classification_sklearn_with_threshold(
xgb_tuned2, X_val, y_val
)
print("Validation performance:")
print(xgboost_random_val)
Training performance: Accuracy Recall Precision F1 0 0.651696 0.998431 0.589577 0.741372 ************************************* Validation performance: Accuracy Recall Precision F1 0 0.405726 0.98773 0.211564 0.348485
Observations:
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_tuned2, X_train_over, y_train_over)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_tuned2, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
Adaboost_grid_train.T,
Adaboost_random_train.T,
GB_grid_train.T,
GB_random_train.T,
xgboost_grid_train.T,
xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost Grid",
"Adaboost Random",
"Gradient Grid",
"Gradient Random",
"XGBoost Grid",
"XGBoost Random",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Adaboost Grid | Adaboost Random | Gradient Grid | Gradient Random | XGBoost Grid | XGBoost Random | |
|---|---|---|---|---|---|---|
| Accuracy | 0.939498 | 0.946950 | 1.0 | 0.999902 | 0.707197 | 0.651696 |
| Recall | 0.963718 | 0.964699 | 1.0 | 0.999804 | 0.999804 | 0.998431 |
| Precision | 0.919192 | 0.931629 | 1.0 | 1.000000 | 0.630706 | 0.589577 |
| F1 | 0.940929 | 0.947876 | 1.0 | 0.999902 | 0.773479 | 0.741372 |
# training performance comparison
models_train_comp_df = pd.concat(
[
Adaboost_grid_val.T,
Adaboost_random_val.T,
GB_grid_val.T,
GB_random_val.T,
xgboost_grid_val.T,
xgboost_random_val.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost Grid",
"Adaboost Random",
"Gradient Grid",
"Gradient Random",
"XGBoost Grid",
"XGBoost Random",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| Adaboost Grid | Adaboost Random | Gradient Grid | Gradient Random | XGBoost Grid | XGBoost Random | |
|---|---|---|---|---|---|---|
| Accuracy | 0.911155 | 0.916091 | 0.950642 | 0.950148 | 0.530109 | 0.405726 |
| Recall | 0.871166 | 0.855828 | 0.849693 | 0.837423 | 1.000000 | 0.987730 |
| Precision | 0.672986 | 0.694030 | 0.844512 | 0.850467 | 0.255086 | 0.211564 |
| F1 | 0.759358 | 0.766484 | 0.847095 | 0.843895 | 0.406484 | 0.348485 |
Observations:
On comparing the Recall metrics of the training score vs Validations scores, we infer
AdaBoost Grid & AdaBoost Random metrics for the validation data are generalizing well with the training data . When compared with the accuracy metric, AdaBoost Random has a better Accuracy
#### We will consider AdaBoost Random as the best model without overfitting & accuracy when compared with the other models
# Calculating different metrics on the test set
AdaBoost_Grid_test = model_performance_classification_sklearn_with_threshold(
adb_tuned1, X_test, y_test
)
# Calculating different metrics on the test set
AdaBoost_Random_test = model_performance_classification_sklearn_with_threshold(
adb_tuned2, X_test, y_test
)
# Calculating different metrics on the test set
GB_Grid_test = model_performance_classification_sklearn_with_threshold(
gb_tuned1, X_test, y_test
)
# Calculating different metrics on the test set
GB_Random_test = model_performance_classification_sklearn_with_threshold(
gb_tuned2, X_test, y_test
)
# Calculating different metrics on the test set
XGB_Grid_test = model_performance_classification_sklearn_with_threshold(
xgb_tuned1, X_test, y_test
)
# Calculating different metrics on the test set
XGB_Random_test = model_performance_classification_sklearn_with_threshold(
xgb_tuned2, X_test, y_test
)
# training performance comparison
models_train_comp_df = pd.concat(
[
Adaboost_grid_train.T,
Adaboost_random_train.T,
GB_grid_train.T,
GB_random_train.T,
xgboost_grid_train.T,
xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost Grid",
"Adaboost Random",
"Gradient Grid",
"Gradient Random",
"XGBoost Grid",
"XGBoost Random",
]
print("Training performance comparison:")
print(models_train_comp_df)
# Testing performance comparison
models_train_comp_df = pd.concat(
[
AdaBoost_Grid_test.T,
AdaBoost_Random_test.T,
GB_Grid_test.T,
GB_Random_test.T,
XGB_Grid_test.T,
XGB_Random_test.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost Grid",
"Adaboost Random",
"Gradient Grid",
"Gradient Random",
"XGBoost Grid",
"XGBoost Random",
]
print("\n\n")
print("Training performance comparison:")
print(models_train_comp_df)
Training performance comparison:
Adaboost Grid Adaboost Random Gradient Grid Gradient Random \
Accuracy 0.939498 0.946950 1.0 0.999902
Recall 0.963718 0.964699 1.0 0.999804
Precision 0.919192 0.931629 1.0 1.000000
F1 0.940929 0.947876 1.0 0.999902
XGBoost Grid XGBoost Random
Accuracy 0.707197 0.651696
Recall 0.999804 0.998431
Precision 0.630706 0.589577
F1 0.773479 0.741372
Training performance comparison:
Adaboost Grid Adaboost Random Gradient Grid Gradient Random \
Accuracy 0.905726 0.917078 0.958539 0.960020
Recall 0.898462 0.889231 0.886154 0.883077
Precision 0.648889 0.686461 0.859701 0.869697
F1 0.753548 0.774799 0.872727 0.876336
XGBoost Grid XGBoost Random
Accuracy 0.496545 0.394867
Recall 1.000000 0.996923
Precision 0.241636 0.209167
F1 0.389222 0.345784
Observations:
- XGBoost Grid & XGBoost Random: Accuracy has dropped significantly on Test data when compared with Train data proving that the model is overfitting though this model has highest Recall scores
- Gradient Grid: The model is overfitting with the data
- Gradient Random: Test data is not well generalized and close to overfitting
- AdaBoost Grid & AdaBoost Random: Test data is generalizing well with the train data and Accuracy is also good. Considering Accuracy, next to Recall metric, AdaBoost Random is being considered as the better model among others
### Feature Importance Using Sklearn
feature_names = X_test.columns
importances = adb_tuned2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Separating target variable and other variables
pX = df.drop(["Attrition_Flag"], axis=1)
pY = df["Attrition_Flag"]
pX.drop(["AgeGroup"], axis=1, inplace=True)
pX.drop(["Months_on_book_Grp"], axis=1, inplace=True)
pX.drop(["Credit_Limit_Grp"], axis=1, inplace=True)
# Identofying the category columns
category_columnNames = pX.describe(include=["category"]).columns
category_columnNames
Index(['Gender', 'Dependent_count', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon'],
dtype='object')
# Identifying the numerical columns
number_columnNames = (
pX.describe(include=["int64"]).columns.tolist()
+ pX.describe(include=["float64"]).columns.tolist()
)
number_columnNames
['Customer_Age', 'Months_on_book', 'Total_Revolving_Bal', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a transformer for categorical variables, which will first apply simple imputer and
#then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, number_columnNames),
("cat", categorical_transformer, category_columnNames),
],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
# Splitting the data into train and test sets
XX_train, XX_test, yy_train, yy_test = train_test_split(
pX, pY, test_size=0.30, random_state=1, stratify=y
)
print(XX_train.shape, XX_test.shape)
(7088, 19) (3039, 19)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"AB",
AdaBoostClassifier(
random_state=1,
n_estimators=10,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
),
),
]
)
# Fit the model on training data
model.fit(XX_train, yy_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Months_on_book',
'Total_Revolving_Bal',
'Total_Trans_Amt',
'Total_Trans_Ct',
'Credit_Limit',
'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio']),
('cat',...
OneHotEncoder(handle_unknown='ignore'))]),
Index(['Gender', 'Dependent_count', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon'],
dtype='object'))])),
('AB',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=10,
random_state=1))])
XX_test["model_predictions"] = model.predict(XX_test)
XX_test[XX_test["model_predictions"] == 1]
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | model_predictions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 39 | M | 2 | Uneducated | Married | $120K + | Blue | 26 | 2 | 3 | 4 | 8906.000000 | 0 | 8906.000000 | 0.315 | 809 | 15 | 0.250 | 0.000 | 1 |
| 6543 | 38 | F | 3 | Graduate | Single | $40K - $60K | Blue | 26 | 3 | 2 | 3 | 2669.000000 | 0 | 2669.000000 | 0.670 | 2271 | 31 | 0.722 | 0.000 | 1 |
| 4483 | 49 | M | 5 | Uneducated | NaN | $60K - $80K | Blue | 43 | 3 | 3 | 3 | 1960.000000 | 0 | 1960.000000 | 0.493 | 2253 | 39 | 0.345 | 0.000 | 1 |
| 4983 | 45 | M | 2 | College | Single | $60K - $80K | Blue | 29 | 3 | 2 | 4 | 3841.000000 | 0 | 3841.000000 | 0.794 | 2832 | 48 | 0.846 | 0.000 | 1 |
| 1743 | 58 | F | 1 | Uneducated | Married | NaN | Blue | 36 | 5 | 2 | 3 | 4784.000000 | 0 | 4784.000000 | 0.905 | 2160 | 54 | 0.929 | 0.000 | 1 |
| 8550 | 56 | F | 3 | High School | NaN | Less than $40K | Blue | 44 | 4 | 3 | 2 | 1667.000000 | 595 | 1072.000000 | 0.630 | 2424 | 40 | 0.481 | 0.357 | 1 |
| 4108 | 49 | M | 1 | NaN | Single | $80K - $120K | Blue | 36 | 6 | 3 | 4 | 12830.000000 | 0 | 11606.000000 | 0.665 | 2512 | 45 | 0.324 | 0.000 | 1 |
| 10089 | 52 | F | 5 | NaN | Married | Less than $40K | Blue | 36 | 4 | 3 | 3 | 9611.000000 | 0 | 9611.000000 | 0.840 | 7636 | 64 | 0.829 | 0.000 | 1 |
| 3168 | 47 | M | 3 | Uneducated | Divorced | $120K + | Silver | 42 | 2 | 4 | 5 | 18442.000000 | 0 | 17117.000000 | 0.521 | 1641 | 35 | 0.591 | 0.000 | 1 |
| 8564 | 57 | M | 2 | NaN | Divorced | $120K + | Blue | 38 | 1 | 2 | 4 | 18442.000000 | 1116 | 17117.000000 | 0.620 | 2836 | 41 | 0.281 | 0.046 | 1 |
| 6891 | 55 | F | 1 | College | Single | Less than $40K | Blue | 48 | 3 | 5 | 3 | 2114.000000 | 546 | 1568.000000 | 0.619 | 2578 | 42 | 0.750 | 0.258 | 1 |
| 6790 | 39 | F | 2 | Graduate | Divorced | Less than $40K | Blue | 36 | 5 | 2 | 3 | 2092.000000 | 0 | 2092.000000 | 0.422 | 2015 | 33 | 0.269 | 0.000 | 1 |
| 8999 | 40 | M | 3 | Graduate | Single | $80K - $120K | Blue | 30 | 2 | 2 | 2 | 12830.000000 | 159 | 11606.000000 | 1.017 | 4983 | 40 | 0.379 | 0.005 | 1 |
| 2403 | 46 | M | 3 | College | Single | $80K - $120K | Blue | 39 | 1 | 4 | 3 | 4026.000000 | 243 | 3783.000000 | 0.738 | 1102 | 27 | 0.800 | 0.060 | 1 |
| 6971 | 47 | M | 2 | NaN | Married | $120K + | Blue | 37 | 6 | 4 | 2 | 11354.000000 | 0 | 11354.000000 | 1.007 | 3073 | 49 | 1.172 | 0.000 | 1 |
| 4302 | 50 | F | 1 | High School | Single | NaN | Blue | 42 | 5 | 0 | 4 | 10057.000000 | 0 | 10057.000000 | 0.792 | 2383 | 42 | 0.448 | 0.000 | 1 |
| 8431 | 52 | M | 2 | Graduate | Single | $60K - $80K | Blue | 45 | 2 | 3 | 3 | 3460.000000 | 0 | 3460.000000 | 0.466 | 2355 | 47 | 0.621 | 0.000 | 1 |
| 7413 | 50 | M | 1 | Post-Graduate | Single | $60K - $80K | Blue | 36 | 4 | 3 | 2 | 2317.000000 | 0 | 2317.000000 | 0.734 | 2214 | 41 | 0.519 | 0.000 | 1 |
| 7835 | 38 | F | 4 | Graduate | Married | $40K - $60K | Blue | 33 | 4 | 3 | 1 | 4047.000000 | 0 | 4047.000000 | 0.648 | 2134 | 34 | 0.478 | 0.000 | 1 |
| 7711 | 41 | M | 4 | Graduate | NaN | $80K - $120K | Blue | 31 | 3 | 3 | 2 | 19782.000000 | 868 | 18914.000000 | 0.808 | 2535 | 44 | 0.467 | 0.044 | 1 |
| 3171 | 41 | F | 3 | NaN | Single | $40K - $60K | Blue | 36 | 3 | 2 | 4 | 5317.000000 | 0 | 5317.000000 | 0.699 | 2003 | 29 | 0.318 | 0.000 | 1 |
| 4878 | 44 | F | 2 | Graduate | Divorced | Less than $40K | Blue | 36 | 3 | 2 | 3 | 1880.000000 | 0 | 1880.000000 | 0.519 | 2469 | 34 | 0.417 | 0.000 | 1 |
| 2793 | 52 | M | 2 | Doctorate | Married | $120K + | Blue | 34 | 3 | 2 | 3 | 11188.000000 | 0 | 11188.000000 | 0.658 | 2109 | 47 | 0.621 | 0.000 | 1 |
| 9769 | 41 | M | 4 | High School | Married | Less than $40K | Blue | 32 | 2 | 3 | 3 | 7769.000000 | 0 | 7769.000000 | 0.943 | 8109 | 74 | 0.762 | 0.000 | 1 |
| 7727 | 44 | F | 4 | Uneducated | Single | NaN | Blue | 36 | 3 | 3 | 4 | 8075.000000 | 317 | 7758.000000 | 0.585 | 2415 | 41 | 0.577 | 0.039 | 1 |
| 9209 | 52 | M | 0 | High School | Single | $60K - $80K | Blue | 39 | 4 | 3 | 2 | 3764.000000 | 0 | 3764.000000 | 0.776 | 6574 | 69 | 0.865 | 0.000 | 1 |
| 8013 | 43 | M | 3 | College | Divorced | $80K - $120K | Blue | 32 | 6 | 3 | 2 | 7278.000000 | 0 | 7278.000000 | 0.618 | 2120 | 43 | 0.593 | 0.000 | 1 |
| 8193 | 52 | M | 3 | Graduate | Single | $80K - $120K | Blue | 47 | 1 | 3 | 3 | 12830.000000 | 0 | 11606.000000 | 0.559 | 5472 | 70 | 0.591 | 0.000 | 1 |
| 7489 | 52 | F | 3 | Graduate | Single | Less than $40K | Blue | 38 | 2 | 3 | 4 | 2437.000000 | 0 | 2437.000000 | 0.912 | 3068 | 53 | 0.656 | 0.000 | 1 |
| 6380 | 50 | M | 3 | Uneducated | Single | $80K - $120K | Blue | 36 | 6 | 3 | 2 | 6925.000000 | 0 | 6925.000000 | 0.821 | 2506 | 36 | 0.440 | 0.000 | 1 |
| 10021 | 30 | F | 1 | Graduate | Married | NaN | Blue | 18 | 4 | 1 | 4 | 4377.000000 | 2517 | 1860.000000 | 0.941 | 8759 | 74 | 0.609 | 0.575 | 1 |
| 3935 | 58 | F | 1 | Uneducated | Divorced | NaN | Blue | 54 | 3 | 5 | 3 | 3266.000000 | 859 | 2407.000000 | 0.740 | 2151 | 46 | 0.438 | 0.263 | 1 |
| 5175 | 50 | F | 2 | Post-Graduate | Married | NaN | Blue | 36 | 5 | 2 | 4 | 4045.000000 | 0 | 4045.000000 | 0.660 | 2438 | 41 | 0.464 | 0.000 | 1 |
| 9158 | 58 | M | 2 | Uneducated | Single | $80K - $120K | Blue | 46 | 1 | 3 | 1 | 10286.000000 | 0 | 10286.000000 | 0.908 | 8199 | 59 | 0.903 | 0.000 | 1 |
| 2511 | 41 | M | 4 | Graduate | NaN | $60K - $80K | Blue | 36 | 2 | 3 | 5 | 1438.300000 | 312 | 1126.300000 | 0.657 | 1786 | 26 | 0.733 | 0.217 | 1 |
| 1295 | 60 | M | 1 | NaN | Divorced | $40K - $60K | Blue | 49 | 4 | 3 | 2 | 3012.000000 | 0 | 3012.000000 | 0.538 | 1315 | 23 | 0.278 | 0.000 | 1 |
| 626 | 55 | M | 3 | Doctorate | Single | $80K - $120K | Blue | 35 | 4 | 1 | 3 | 12830.000000 | 0 | 11606.000000 | 0.881 | 837 | 25 | 0.667 | 0.000 | 1 |
| 7165 | 47 | F | 3 | Graduate | Married | NaN | Blue | 36 | 3 | 3 | 1 | 5590.000000 | 0 | 5590.000000 | 0.289 | 1507 | 32 | 0.228 | 0.000 | 1 |
| 4900 | 55 | F | 4 | High School | Married | Less than $40K | Blue | 36 | 4 | 2 | 3 | 1477.000000 | 0 | 1477.000000 | 0.719 | 2419 | 49 | 0.531 | 0.000 | 1 |
| 8566 | 59 | M | 0 | Uneducated | NaN | $60K - $80K | Blue | 48 | 2 | 6 | 2 | 13172.000000 | 0 | 13172.000000 | 0.876 | 2598 | 47 | 0.621 | 0.000 | 1 |
| 8665 | 49 | F | 3 | High School | Married | NaN | Silver | 36 | 2 | 3 | 3 | 6830.221299 | 1273 | 5734.696416 | 0.977 | 4777 | 51 | 0.457 | 0.037 | 1 |
| 4064 | 46 | M | 3 | College | Married | $80K - $120K | Blue | 26 | 6 | 2 | 3 | 1438.300000 | 864 | 574.300000 | 0.766 | 2299 | 40 | 0.481 | 0.601 | 1 |
| 191 | 43 | M | 4 | Graduate | NaN | $80K - $120K | Blue | 27 | 5 | 2 | 0 | 12830.000000 | 0 | 11606.000000 | 0.731 | 1376 | 35 | 0.591 | 0.000 | 1 |
| 3204 | 38 | F | 1 | NaN | Married | Less than $40K | Blue | 20 | 2 | 2 | 3 | 1621.000000 | 580 | 1041.000000 | 0.421 | 1893 | 41 | 0.228 | 0.358 | 1 |
| 3491 | 51 | M | 3 | Graduate | Single | $80K - $120K | Blue | 43 | 4 | 2 | 3 | 10458.000000 | 0 | 10458.000000 | 0.587 | 3481 | 51 | 0.500 | 0.000 | 1 |
| 5064 | 56 | F | 1 | College | Single | Less than $40K | Blue | 43 | 3 | 2 | 2 | 2846.000000 | 0 | 2846.000000 | 0.747 | 2122 | 44 | 0.419 | 0.000 | 1 |
| 8315 | 45 | M | 3 | Uneducated | Single | $120K + | Blue | 35 | 2 | 3 | 5 | 7135.000000 | 2517 | 4618.000000 | 0.631 | 2305 | 42 | 0.448 | 0.353 | 1 |
| 9482 | 44 | M | 3 | Graduate | Divorced | $40K - $60K | Blue | 32 | 1 | 2 | 2 | 3575.000000 | 0 | 3575.000000 | 0.843 | 7672 | 66 | 0.737 | 0.000 | 1 |
| 9231 | 51 | M | 4 | Graduate | Single | $80K - $120K | Silver | 42 | 6 | 4 | 2 | 12830.000000 | 230 | 11606.000000 | 1.004 | 8629 | 65 | 0.548 | 0.007 | 1 |
| 4696 | 51 | M | 2 | High School | Married | $80K - $120K | Blue | 41 | 5 | 2 | 3 | 14902.000000 | 0 | 14902.000000 | 0.312 | 2038 | 39 | 0.625 | 0.000 | 1 |
| 4872 | 40 | M | 3 | High School | NaN | $80K - $120K | Blue | 34 | 1 | 3 | 4 | 12830.000000 | 0 | 11606.000000 | 0.399 | 2128 | 39 | 0.345 | 0.000 | 1 |
| 1357 | 62 | F | 0 | High School | Married | Less than $40K | Blue | 51 | 6 | 3 | 4 | 1438.300000 | 0 | 1438.300000 | 0.702 | 1445 | 38 | 0.310 | 0.000 | 1 |
| 880 | 56 | M | 0 | College | Married | $80K - $120K | Blue | 51 | 3 | 2 | 3 | 14501.000000 | 0 | 14501.000000 | 0.854 | 1031 | 25 | 0.389 | 0.000 | 1 |
| 5523 | 53 | F | 3 | High School | Married | Less than $40K | Blue | 41 | 3 | 3 | 2 | 1547.000000 | 0 | 1547.000000 | 0.845 | 2312 | 43 | 0.303 | 0.000 | 1 |
| 4603 | 49 | F | 3 | Post-Graduate | Married | Less than $40K | Blue | 35 | 3 | 3 | 2 | 1914.000000 | 0 | 1914.000000 | 0.677 | 2687 | 41 | 0.414 | 0.000 | 1 |
| 5384 | 32 | F | 0 | Graduate | Single | Less than $40K | Blue | 20 | 5 | 1 | 5 | 3983.000000 | 431 | 3552.000000 | 0.887 | 3098 | 46 | 0.438 | 0.108 | 1 |
| 5191 | 53 | F | 3 | High School | Divorced | NaN | Blue | 36 | 4 | 3 | 6 | 7939.000000 | 0 | 7939.000000 | 0.551 | 2269 | 42 | 0.312 | 0.000 | 1 |
| 4274 | 41 | M | 3 | NaN | Married | $80K - $120K | Blue | 22 | 3 | 3 | 4 | 12830.000000 | 0 | 11606.000000 | 0.547 | 1874 | 42 | 0.615 | 0.000 | 1 |
| 4774 | 36 | F | 1 | NaN | Single | Less than $40K | Blue | 24 | 2 | 1 | 2 | 1735.000000 | 0 | 1735.000000 | 0.740 | 2467 | 35 | 0.346 | 0.000 | 1 |
| 7811 | 38 | F | 2 | Graduate | NaN | Less than $40K | Blue | 36 | 2 | 3 | 2 | 2369.000000 | 0 | 2369.000000 | 0.510 | 1924 | 48 | 0.548 | 0.000 | 1 |
| 6928 | 54 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 35 | 4 | 3 | 4 | 2189.000000 | 382 | 1807.000000 | 0.884 | 2750 | 51 | 0.594 | 0.175 | 1 |
| 4824 | 45 | F | 0 | NaN | Married | $40K - $60K | Blue | 30 | 3 | 2 | 4 | 2759.000000 | 0 | 2759.000000 | 0.535 | 2061 | 47 | 0.424 | 0.000 | 1 |
| 7203 | 42 | F | 4 | Uneducated | Married | Less than $40K | Blue | 23 | 3 | 2 | 3 | 3214.000000 | 0 | 3214.000000 | 0.435 | 2119 | 44 | 0.517 | 0.000 | 1 |
| 3710 | 38 | M | 2 | Graduate | Married | $60K - $80K | Blue | 36 | 2 | 3 | 2 | 1438.300000 | 0 | 1438.300000 | 0.461 | 1651 | 42 | 0.500 | 0.000 | 1 |
| 3717 | 49 | M | 1 | High School | NaN | $60K - $80K | Blue | 38 | 3 | 3 | 2 | 15898.000000 | 0 | 15898.000000 | 1.049 | 4184 | 54 | 1.077 | 0.000 | 1 |
| 3474 | 47 | M | 5 | High School | Married | $80K - $120K | Blue | 37 | 2 | 2 | 2 | 9410.000000 | 0 | 9410.000000 | 0.586 | 2178 | 41 | 0.864 | 0.000 | 1 |
| 5149 | 54 | F | 3 | College | Married | Less than $40K | Blue | 44 | 5 | 3 | 2 | 2921.000000 | 2412 | 509.000000 | 0.823 | 2612 | 33 | 0.375 | 0.826 | 1 |
| 7070 | 40 | M | 3 | Graduate | Married | $120K + | Blue | 27 | 5 | 3 | 2 | 2269.000000 | 0 | 2269.000000 | 0.572 | 2108 | 39 | 0.560 | 0.000 | 1 |
| 910 | 26 | M | 0 | Graduate | Single | NaN | Blue | 19 | 4 | 1 | 2 | 1438.300000 | 0 | 1438.300000 | 0.472 | 2005 | 47 | 0.469 | 0.000 | 1 |
| 4863 | 49 | F | 4 | Graduate | Married | Less than $40K | Blue | 38 | 5 | 2 | 4 | 1757.000000 | 0 | 1757.000000 | 0.890 | 2557 | 45 | 0.406 | 0.000 | 1 |
| 3694 | 38 | F | 2 | Graduate | Single | Less than $40K | Blue | 32 | 2 | 3 | 3 | 3775.000000 | 0 | 3775.000000 | 0.581 | 2031 | 36 | 0.385 | 0.000 | 1 |
| 8012 | 49 | F | 4 | Uneducated | Married | Less than $40K | Blue | 31 | 4 | 3 | 2 | 1567.000000 | 0 | 1567.000000 | 0.424 | 2138 | 41 | 0.577 | 0.000 | 1 |
| 8334 | 44 | F | 2 | Graduate | Single | $40K - $60K | Blue | 35 | 2 | 3 | 4 | 4196.000000 | 561 | 3635.000000 | 0.741 | 2558 | 38 | 0.652 | 0.134 | 1 |
| 9776 | 52 | M | 2 | Graduate | Divorced | $80K - $120K | Silver | 44 | 4 | 3 | 2 | 12830.000000 | 751 | 11606.000000 | 0.357 | 5806 | 40 | 0.429 | 0.022 | 1 |
| 5155 | 51 | F | 1 | NaN | Married | Less than $40K | Blue | 32 | 5 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.905 | 2585 | 45 | 0.500 | 0.000 | 1 |
| 6013 | 45 | M | 3 | Uneducated | Married | $80K - $120K | Blue | 38 | 3 | 3 | 3 | 9234.000000 | 0 | 9234.000000 | 0.629 | 2586 | 44 | 0.257 | 0.000 | 1 |
| 5444 | 59 | M | 0 | Post-Graduate | Married | $80K - $120K | Blue | 46 | 5 | 4 | 2 | 6526.000000 | 489 | 6037.000000 | 0.678 | 2331 | 45 | 0.452 | 0.075 | 1 |
| 1830 | 65 | F | 0 | Graduate | Divorced | Less than $40K | Blue | 56 | 3 | 6 | 3 | 6184.000000 | 0 | 6184.000000 | 1.016 | 1712 | 27 | 0.286 | 0.000 | 1 |
| 8123 | 38 | F | 2 | High School | Single | Less than $40K | Blue | 16 | 2 | 1 | 2 | 2703.000000 | 0 | 2703.000000 | 0.882 | 2752 | 43 | 0.433 | 0.000 | 1 |
| 3398 | 43 | M | 3 | High School | Single | $60K - $80K | Blue | 38 | 2 | 3 | 2 | 12254.000000 | 0 | 12254.000000 | 0.705 | 2129 | 42 | 0.355 | 0.000 | 1 |
| 1892 | 26 | F | 1 | College | Single | Less than $40K | Blue | 15 | 5 | 2 | 4 | 1438.300000 | 737 | 701.300000 | 0.806 | 2856 | 39 | 0.500 | 0.512 | 1 |
| 1448 | 63 | F | 0 | Uneducated | Married | NaN | Blue | 54 | 5 | 3 | 2 | 11827.000000 | 0 | 11827.000000 | 0.921 | 1395 | 37 | 0.609 | 0.000 | 1 |
| 2098 | 54 | F | 2 | Post-Graduate | Married | Less than $40K | Blue | 47 | 3 | 4 | 3 | 1438.300000 | 0 | 1438.300000 | 1.053 | 1154 | 22 | 0.375 | 0.000 | 1 |
| 7317 | 56 | F | 2 | Uneducated | Single | Less than $40K | Blue | 36 | 1 | 3 | 5 | 1883.000000 | 400 | 1483.000000 | 0.528 | 1954 | 45 | 0.500 | 0.212 | 1 |
| 9333 | 47 | M | 4 | NaN | NaN | $80K - $120K | Blue | 42 | 1 | 4 | 1 | 7953.000000 | 0 | 7953.000000 | 1.031 | 8670 | 74 | 0.574 | 0.000 | 1 |
| 306 | 36 | F | 3 | High School | Married | NaN | Blue | 24 | 4 | 1 | 1 | 15439.000000 | 0 | 15439.000000 | 0.742 | 2069 | 43 | 0.536 | 0.000 | 1 |
| 7277 | 37 | F | 3 | High School | Single | Less than $40K | Blue | 29 | 3 | 3 | 3 | 1653.000000 | 0 | 1653.000000 | 0.517 | 2284 | 35 | 0.296 | 0.000 | 1 |
| 9680 | 35 | F | 0 | Doctorate | Married | Less than $40K | Blue | 27 | 2 | 3 | 2 | 3876.000000 | 0 | 3876.000000 | 0.848 | 7712 | 73 | 0.825 | 0.000 | 1 |
| 6113 | 35 | F | 1 | NaN | Divorced | NaN | Blue | 24 | 5 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.849 | 2355 | 47 | 0.567 | 0.000 | 1 |
| 1268 | 49 | M | 2 | Uneducated | Married | $80K - $120K | Blue | 43 | 1 | 3 | 3 | 5701.000000 | 2517 | 3184.000000 | 0.716 | 961 | 32 | 0.524 | 0.442 | 1 |
| 8857 | 52 | M | 1 | High School | Married | $120K + | Blue | 34 | 2 | 3 | 1 | 18442.000000 | 0 | 17117.000000 | 1.038 | 4863 | 56 | 0.474 | 0.000 | 1 |
| 6040 | 37 | F | 2 | NaN | Single | Less than $40K | Blue | 36 | 3 | 2 | 3 | 1534.000000 | 0 | 1534.000000 | 0.642 | 2596 | 39 | 0.393 | 0.000 | 1 |
| 8574 | 60 | M | 1 | High School | Divorced | $80K - $120K | Blue | 48 | 1 | 5 | 3 | 12830.000000 | 0 | 11606.000000 | 0.645 | 2251 | 27 | 0.228 | 0.000 | 1 |
| 6546 | 48 | F | 2 | High School | Married | Less than $40K | Blue | 36 | 6 | 3 | 3 | 1438.300000 | 455 | 983.300000 | 0.626 | 2090 | 50 | 0.562 | 0.316 | 1 |
| 10064 | 43 | F | 4 | High School | NaN | Less than $40K | Silver | 31 | 6 | 2 | 2 | 13651.000000 | 0 | 13651.000000 | 1.046 | 9772 | 71 | 0.775 | 0.000 | 1 |
| 4335 | 43 | M | 2 | Graduate | Single | $80K - $120K | Blue | 30 | 3 | 3 | 3 | 12830.000000 | 0 | 11606.000000 | 0.688 | 2427 | 51 | 0.645 | 0.000 | 1 |
| 4103 | 48 | M | 5 | High School | Married | $40K - $60K | Blue | 39 | 3 | 3 | 1 | 7897.000000 | 157 | 7740.000000 | 0.699 | 2270 | 46 | 0.394 | 0.020 | 1 |
| 6469 | 47 | F | 3 | Doctorate | Single | Less than $40K | Blue | 40 | 6 | 2 | 3 | 2437.000000 | 0 | 2437.000000 | 0.664 | 2027 | 42 | 0.448 | 0.000 | 1 |
| 1719 | 61 | F | 1 | Doctorate | Married | $40K - $60K | Blue | 47 | 4 | 2 | 4 | 8633.000000 | 2517 | 6116.000000 | 0.681 | 2066 | 51 | 0.457 | 0.292 | 1 |
| 3718 | 54 | F | 3 | Post-Graduate | Single | Less than $40K | Blue | 49 | 1 | 4 | 3 | 8516.000000 | 0 | 8516.000000 | 0.549 | 2189 | 37 | 0.276 | 0.000 | 1 |
| 4339 | 46 | F | 4 | Doctorate | Married | NaN | Blue | 26 | 3 | 3 | 6 | 15195.000000 | 0 | 15195.000000 | 0.726 | 2631 | 41 | 0.640 | 0.000 | 1 |
| 4569 | 56 | M | 2 | Graduate | Single | $80K - $120K | Blue | 36 | 1 | 2 | 2 | 1438.300000 | 0 | 1438.300000 | 0.650 | 2123 | 37 | 0.542 | 0.000 | 1 |
| 4733 | 47 | F | 1 | NaN | Single | Less than $40K | Blue | 37 | 4 | 2 | 1 | 2173.000000 | 596 | 1577.000000 | 0.810 | 2738 | 45 | 0.406 | 0.274 | 1 |
| 6476 | 42 | F | 3 | Graduate | Single | $40K - $60K | Blue | 36 | 6 | 2 | 3 | 1451.000000 | 599 | 852.000000 | 0.752 | 2192 | 35 | 0.458 | 0.413 | 1 |
| 7903 | 48 | F | 2 | Graduate | Single | NaN | Blue | 37 | 4 | 2 | 3 | 11750.000000 | 487 | 11263.000000 | 0.772 | 2532 | 48 | 0.548 | 0.041 | 1 |
| 2184 | 39 | F | 2 | Uneducated | Married | Less than $40K | Blue | 26 | 4 | 2 | 5 | 5484.000000 | 2517 | 2967.000000 | 0.489 | 2025 | 51 | 0.378 | 0.459 | 1 |
| 1602 | 54 | F | 3 | College | Single | NaN | Blue | 49 | 6 | 2 | 3 | 13184.000000 | 0 | 13184.000000 | 1.166 | 2047 | 33 | 0.228 | 0.000 | 1 |
| 3418 | 60 | F | 0 | High School | Single | $40K - $60K | Blue | 55 | 1 | 5 | 5 | 10882.000000 | 2517 | 8365.000000 | 0.289 | 1861 | 46 | 0.278 | 0.231 | 1 |
| 9566 | 56 | M | 2 | NaN | Married | Less than $40K | Blue | 44 | 5 | 3 | 1 | 4390.000000 | 0 | 4390.000000 | 1.005 | 8454 | 69 | 0.769 | 0.000 | 1 |
| 4298 | 50 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 45 | 2 | 2 | 5 | 7660.000000 | 0 | 6418.500000 | 0.637 | 1882 | 39 | 0.345 | 0.000 | 1 |
| 9565 | 49 | M | 3 | Uneducated | Married | $120K + | Blue | 38 | 2 | 3 | 3 | 8549.000000 | 0 | 8549.000000 | 1.023 | 9192 | 77 | 0.791 | 0.000 | 1 |
| 5450 | 40 | F | 2 | High School | Single | NaN | Blue | 30 | 6 | 3 | 4 | 6479.000000 | 0 | 6479.000000 | 0.946 | 2539 | 42 | 0.500 | 0.000 | 1 |
| 3956 | 49 | M | 1 | Graduate | Single | $120K + | Blue | 36 | 3 | 4 | 3 | 8136.000000 | 2517 | 5619.000000 | 0.674 | 2179 | 40 | 0.429 | 0.309 | 1 |
| 268 | 64 | F | 0 | High School | Married | $40K - $60K | Blue | 53 | 1 | 3 | 3 | 3353.000000 | 0 | 3353.000000 | 1.021 | 960 | 18 | 0.385 | 0.000 | 1 |
| 5540 | 57 | M | 2 | Graduate | Single | $120K + | Blue | 44 | 5 | 3 | 4 | 18442.000000 | 219 | 17117.000000 | 0.482 | 1882 | 48 | 0.714 | 0.006 | 1 |
| 9848 | 30 | M | 0 | Graduate | Married | $40K - $60K | Blue | 23 | 6 | 2 | 4 | 4063.000000 | 0 | 4063.000000 | 0.776 | 7463 | 70 | 0.707 | 0.000 | 1 |
| 8076 | 56 | F | 1 | College | Married | NaN | Blue | 44 | 2 | 4 | 4 | 1438.300000 | 0 | 1438.300000 | 0.859 | 2264 | 37 | 0.423 | 0.000 | 1 |
| 207 | 59 | M | 2 | Post-Graduate | Married | $60K - $80K | Blue | 46 | 3 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.688 | 844 | 24 | 0.500 | 0.000 | 1 |
| 7768 | 42 | M | 4 | Uneducated | Married | $120K + | Blue | 34 | 2 | 3 | 3 | 5195.000000 | 468 | 4727.000000 | 0.567 | 2682 | 48 | 0.231 | 0.090 | 1 |
| 5115 | 38 | F | 0 | High School | Single | Less than $40K | Blue | 22 | 3 | 2 | 5 | 1556.000000 | 0 | 1556.000000 | 0.690 | 2283 | 46 | 0.586 | 0.000 | 1 |
| 39 | 66 | F | 0 | Doctorate | Married | NaN | Blue | 56 | 5 | 4 | 3 | 7882.000000 | 605 | 7277.000000 | 1.052 | 704 | 16 | 0.228 | 0.077 | 1 |
| 409 | 55 | F | 2 | Graduate | Single | Less than $40K | Blue | 47 | 5 | 3 | 2 | 6347.000000 | 0 | 6347.000000 | 0.466 | 1161 | 44 | 0.375 | 0.000 | 1 |
| 4791 | 53 | F | 2 | College | Single | NaN | Blue | 49 | 2 | 2 | 4 | 10195.000000 | 0 | 10195.000000 | 0.629 | 2489 | 54 | 0.385 | 0.000 | 1 |
| 9166 | 49 | F | 2 | NaN | Married | Less than $40K | Blue | 41 | 5 | 2 | 2 | 3243.000000 | 0 | 3243.000000 | 0.685 | 6059 | 71 | 0.821 | 0.000 | 1 |
| 4593 | 56 | M | 4 | High School | Married | $60K - $80K | Blue | 43 | 6 | 2 | 6 | 4535.000000 | 0 | 4535.000000 | 0.831 | 2342 | 40 | 0.600 | 0.000 | 1 |
| 7187 | 38 | F | 2 | NaN | Married | $40K - $60K | Blue | 28 | 6 | 2 | 2 | 1628.000000 | 0 | 1628.000000 | 0.876 | 2527 | 41 | 0.640 | 0.000 | 1 |
| 1179 | 48 | F | 2 | Uneducated | Married | $40K - $60K | Blue | 36 | 3 | 3 | 2 | 5210.000000 | 0 | 5210.000000 | 0.531 | 1309 | 45 | 0.607 | 0.000 | 1 |
| 5189 | 44 | F | 2 | Graduate | Single | Less than $40K | Blue | 36 | 5 | 2 | 4 | 1438.300000 | 0 | 1438.300000 | 0.421 | 1944 | 36 | 0.500 | 0.000 | 1 |
| 9280 | 44 | M | 2 | Uneducated | Single | $80K - $120K | Silver | 36 | 2 | 3 | 3 | 12830.000000 | 569 | 11606.000000 | 0.919 | 8570 | 68 | 0.789 | 0.016 | 1 |
| 7276 | 40 | F | 4 | Doctorate | Married | Less than $40K | Blue | 21 | 1 | 2 | 1 | 2984.000000 | 2517 | 467.000000 | 0.574 | 2373 | 53 | 0.359 | 0.843 | 1 |
| 7330 | 47 | F | 3 | NaN | Married | Less than $40K | Blue | 38 | 3 | 3 | 2 | 1438.300000 | 0 | 1438.300000 | 0.289 | 1687 | 38 | 0.228 | 0.000 | 1 |
| 8450 | 32 | F | 0 | NaN | Married | Less than $40K | Blue | 36 | 5 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.569 | 2464 | 37 | 0.321 | 0.000 | 1 |
| 5057 | 38 | F | 4 | NaN | Married | Less than $40K | Blue | 26 | 3 | 2 | 2 | 1438.300000 | 0 | 1438.300000 | 0.777 | 2264 | 39 | 0.625 | 0.000 | 1 |
| 8900 | 51 | M | 1 | NaN | Married | $60K - $80K | Blue | 45 | 2 | 3 | 3 | 11548.000000 | 0 | 11548.000000 | 0.611 | 7781 | 72 | 0.895 | 0.000 | 1 |
| 4393 | 28 | F | 1 | Graduate | Single | Less than $40K | Blue | 36 | 4 | 2 | 5 | 1624.000000 | 0 | 1624.000000 | 0.299 | 1897 | 35 | 0.250 | 0.000 | 1 |
| 8990 | 38 | F | 2 | College | NaN | Less than $40K | Blue | 32 | 2 | 2 | 6 | 7385.000000 | 1817 | 5568.000000 | 1.029 | 5070 | 50 | 0.389 | 0.246 | 1 |
| 3306 | 50 | M | 2 | High School | Single | $80K - $120K | Blue | 38 | 1 | 2 | 3 | 4915.000000 | 464 | 4451.000000 | 0.289 | 1474 | 36 | 0.228 | 0.094 | 1 |
| 3594 | 47 | M | 4 | Uneducated | NaN | NaN | Blue | 36 | 3 | 3 | 3 | 18873.000000 | 0 | 18873.000000 | 0.636 | 2021 | 38 | 0.520 | 0.000 | 1 |
| 3561 | 47 | F | 2 | Graduate | Married | NaN | Blue | 36 | 1 | 3 | 2 | 18799.000000 | 2517 | 16282.000000 | 0.813 | 2315 | 52 | 0.625 | 0.134 | 1 |
| 677 | 48 | M | 3 | High School | Married | $80K - $120K | Blue | 42 | 1 | 1 | 3 | 9959.000000 | 0 | 9959.000000 | 0.688 | 805 | 24 | 0.228 | 0.000 | 1 |
| 3490 | 61 | F | 1 | Graduate | Single | Less than $40K | Blue | 54 | 3 | 2 | 2 | 2721.000000 | 0 | 2721.000000 | 0.734 | 2088 | 37 | 0.682 | 0.000 | 1 |
| 9625 | 30 | F | 0 | Uneducated | Married | Less than $40K | Blue | 13 | 3 | 1 | 3 | 6075.000000 | 0 | 6075.000000 | 0.923 | 9242 | 63 | 0.909 | 0.000 | 1 |
| 3363 | 51 | F | 3 | NaN | Single | $40K - $60K | Blue | 38 | 3 | 3 | 3 | 5026.000000 | 0 | 5026.000000 | 0.564 | 1844 | 37 | 0.609 | 0.000 | 1 |
| 9623 | 51 | F | 1 | High School | Single | NaN | Blue | 41 | 3 | 3 | 2 | 9701.000000 | 0 | 9701.000000 | 1.024 | 9404 | 62 | 0.824 | 0.000 | 1 |
| 7902 | 56 | F | 2 | NaN | Single | $40K - $60K | Blue | 37 | 2 | 2 | 3 | 2963.000000 | 0 | 2963.000000 | 0.902 | 2919 | 47 | 0.382 | 0.000 | 1 |
| 5177 | 34 | F | 2 | Graduate | Married | Less than $40K | Blue | 29 | 5 | 2 | 2 | 5845.000000 | 0 | 5845.000000 | 0.504 | 2308 | 35 | 0.591 | 0.000 | 1 |
| 9889 | 43 | F | 3 | NaN | Single | $40K - $60K | Silver | 36 | 2 | 3 | 4 | 19713.000000 | 531 | 19182.000000 | 0.792 | 8301 | 70 | 0.591 | 0.027 | 1 |
| 6295 | 48 | F | 3 | Doctorate | Single | Less than $40K | Blue | 39 | 3 | 3 | 4 | 1644.000000 | 500 | 1144.000000 | 0.784 | 2450 | 45 | 0.406 | 0.304 | 1 |
| 10022 | 46 | M | 3 | Graduate | Married | $60K - $80K | Blue | 34 | 1 | 2 | 4 | 4930.000000 | 159 | 4771.000000 | 0.592 | 7412 | 60 | 0.579 | 0.032 | 1 |
| 5617 | 59 | F | 0 | NaN | Married | NaN | Blue | 49 | 5 | 2 | 4 | 6996.000000 | 0 | 6996.000000 | 0.871 | 2384 | 33 | 0.375 | 0.000 | 1 |
| 6465 | 43 | F | 3 | Graduate | Single | $40K - $60K | Blue | 26 | 3 | 2 | 2 | 2867.000000 | 0 | 2867.000000 | 0.703 | 2646 | 41 | 0.464 | 0.000 | 1 |
| 7870 | 54 | F | 1 | NaN | Married | $40K - $60K | Blue | 49 | 4 | 3 | 6 | 2471.000000 | 0 | 2471.000000 | 0.722 | 2519 | 35 | 0.522 | 0.000 | 1 |
| 8503 | 59 | F | 0 | NaN | Single | $40K - $60K | Blue | 52 | 2 | 4 | 3 | 1852.000000 | 0 | 1852.000000 | 0.902 | 2851 | 49 | 0.750 | 0.000 | 1 |
| 6340 | 40 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 29 | 3 | 2 | 2 | 6332.000000 | 318 | 6014.000000 | 0.783 | 2505 | 43 | 0.536 | 0.050 | 1 |
| 343 | 58 | M | 1 | Uneducated | Married | $40K - $60K | Blue | 36 | 2 | 4 | 3 | 1438.300000 | 537 | 901.300000 | 0.819 | 715 | 13 | 0.228 | 0.373 | 1 |
| 7390 | 50 | M | 1 | NaN | Single | $60K - $80K | Blue | 38 | 4 | 3 | 3 | 7603.000000 | 744 | 6859.000000 | 0.711 | 2497 | 42 | 0.273 | 0.098 | 1 |
| 3148 | 53 | M | 4 | NaN | Single | $120K + | Blue | 48 | 3 | 3 | 4 | 18442.000000 | 278 | 17117.000000 | 0.817 | 3029 | 49 | 0.400 | 0.008 | 1 |
| 4207 | 50 | F | 4 | Graduate | Single | NaN | Blue | 41 | 5 | 3 | 3 | 2509.000000 | 0 | 2509.000000 | 0.702 | 2281 | 41 | 0.367 | 0.000 | 1 |
| 6388 | 41 | M | 2 | High School | Married | $120K + | Blue | 24 | 5 | 2 | 3 | 9924.000000 | 0 | 9924.000000 | 0.517 | 1949 | 43 | 0.483 | 0.000 | 1 |
| 5231 | 47 | F | 3 | Graduate | NaN | Less than $40K | Blue | 36 | 5 | 2 | 2 | 8109.000000 | 0 | 8109.000000 | 0.601 | 1835 | 43 | 0.344 | 0.000 | 1 |
| 4844 | 43 | M | 3 | College | Married | $80K - $120K | Blue | 25 | 3 | 3 | 3 | 2900.000000 | 2517 | 383.000000 | 0.719 | 2180 | 31 | 0.409 | 0.868 | 1 |
| 6411 | 44 | F | 2 | High School | NaN | NaN | Gold | 35 | 3 | 3 | 3 | 6830.221299 | 0 | 5734.696416 | 0.767 | 2227 | 44 | 0.630 | 0.000 | 1 |
| 3321 | 45 | M | 2 | Graduate | Divorced | $40K - $60K | Blue | 36 | 2 | 2 | 4 | 9569.000000 | 0 | 9569.000000 | 0.598 | 1724 | 35 | 0.250 | 0.000 | 1 |
| 3049 | 48 | M | 3 | Post-Graduate | NaN | $120K + | Blue | 35 | 2 | 2 | 6 | 18442.000000 | 0 | 17117.000000 | 0.464 | 1632 | 44 | 0.333 | 0.000 | 1 |
| 4153 | 42 | F | 3 | Doctorate | Single | Less than $40K | Blue | 18 | 5 | 3 | 4 | 8668.000000 | 0 | 8668.000000 | 0.289 | 1402 | 38 | 0.310 | 0.000 | 1 |
| 6671 | 48 | F | 2 | NaN | Married | NaN | Blue | 38 | 6 | 2 | 3 | 4382.000000 | 0 | 4382.000000 | 0.714 | 2446 | 34 | 0.417 | 0.000 | 1 |
| 922 | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 36 | 3 | 3 | 3 | 4829.000000 | 2418 | 2411.000000 | 0.601 | 938 | 21 | 0.400 | 0.501 | 1 |
| 6822 | 40 | M | 3 | Graduate | Single | $80K - $120K | Blue | 36 | 1 | 2 | 2 | 6921.000000 | 0 | 6921.000000 | 0.824 | 2729 | 44 | 0.467 | 0.000 | 1 |
| 534 | 53 | F | 2 | Doctorate | Married | NaN | Blue | 39 | 1 | 2 | 3 | 1438.300000 | 0 | 1438.300000 | 0.644 | 807 | 22 | 0.692 | 0.000 | 1 |
| 5117 | 33 | M | 2 | Uneducated | Single | $40K - $60K | Blue | 18 | 1 | 1 | 6 | 5600.000000 | 2517 | 3083.000000 | 0.628 | 2296 | 46 | 0.586 | 0.449 | 1 |
| 2109 | 37 | M | 2 | Graduate | Single | $120K + | Blue | 25 | 3 | 3 | 0 | 5616.000000 | 0 | 5616.000000 | 0.995 | 1564 | 31 | 0.476 | 0.000 | 1 |
| 9195 | 43 | M | 4 | Uneducated | Divorced | $80K - $120K | Blue | 36 | 2 | 3 | 2 | 19333.000000 | 0 | 19333.000000 | 0.969 | 7847 | 63 | 0.750 | 0.000 | 1 |
| 5626 | 33 | F | 2 | Graduate | Single | Less than $40K | Blue | 24 | 6 | 2 | 3 | 1806.000000 | 0 | 1806.000000 | 0.983 | 2833 | 43 | 0.654 | 0.000 | 1 |
| 8545 | 52 | F | 3 | Uneducated | Single | $40K - $60K | Blue | 36 | 2 | 3 | 3 | 1915.000000 | 1445 | 470.000000 | 0.374 | 2264 | 39 | 0.560 | 0.755 | 1 |
| 5086 | 46 | F | 4 | Graduate | Married | NaN | Blue | 36 | 4 | 3 | 4 | 2286.000000 | 0 | 2286.000000 | 1.057 | 2798 | 40 | 0.481 | 0.000 | 1 |
| 6491 | 64 | F | 0 | Graduate | Single | Less than $40K | Blue | 36 | 1 | 3 | 4 | 1682.000000 | 0 | 1682.000000 | 0.750 | 1811 | 39 | 0.500 | 0.000 | 1 |
| 5857 | 50 | M | 1 | Graduate | Divorced | $120K + | Blue | 42 | 1 | 3 | 2 | 1656.000000 | 0 | 1656.000000 | 0.626 | 2073 | 44 | 0.833 | 0.000 | 1 |
| 10050 | 49 | M | 1 | Uneducated | Single | $120K + | Blue | 30 | 5 | 3 | 3 | 5181.000000 | 0 | 5181.000000 | 0.830 | 8943 | 68 | 1.000 | 0.000 | 1 |
| 8110 | 54 | F | 1 | Graduate | Married | $40K - $60K | Blue | 46 | 6 | 4 | 3 | 1632.000000 | 351 | 1281.000000 | 0.721 | 2532 | 43 | 0.483 | 0.215 | 1 |
| 7497 | 46 | M | 4 | College | Married | $60K - $80K | Blue | 36 | 4 | 2 | 1 | 1681.000000 | 0 | 1681.000000 | 0.464 | 1953 | 48 | 0.500 | 0.000 | 1 |
| 6252 | 38 | F | 3 | Graduate | Married | Less than $40K | Blue | 36 | 3 | 2 | 3 | 2300.000000 | 0 | 2300.000000 | 0.703 | 2500 | 47 | 0.228 | 0.000 | 1 |
| 10008 | 50 | M | 2 | NaN | Married | $120K + | Blue | 36 | 6 | 2 | 2 | 16081.000000 | 492 | 15589.000000 | 0.289 | 6100 | 61 | 0.564 | 0.031 | 1 |
| 7226 | 46 | F | 2 | Post-Graduate | Single | NaN | Blue | 37 | 1 | 3 | 3 | 4848.000000 | 2517 | 2331.000000 | 0.651 | 2353 | 53 | 0.828 | 0.519 | 1 |
| 1811 | 39 | M | 3 | Graduate | Married | $40K - $60K | Blue | 28 | 4 | 3 | 4 | 3478.000000 | 0 | 3478.000000 | 0.650 | 2133 | 54 | 0.543 | 0.000 | 1 |
| 6103 | 41 | F | 3 | College | Married | $40K - $60K | Blue | 15 | 5 | 3 | 4 | 4312.000000 | 2517 | 1795.000000 | 0.741 | 2693 | 56 | 0.436 | 0.584 | 1 |
| 7704 | 38 | F | 1 | Graduate | Married | NaN | Blue | 29 | 3 | 3 | 3 | 8953.000000 | 0 | 8953.000000 | 0.403 | 1982 | 43 | 0.303 | 0.000 | 1 |
| 2019 | 37 | M | 3 | College | Married | $60K - $80K | Blue | 19 | 2 | 3 | 4 | 3100.000000 | 2517 | 583.000000 | 0.350 | 694 | 12 | 0.228 | 0.812 | 1 |
| 4117 | 46 | F | 4 | Uneducated | Married | Less than $40K | Blue | 36 | 3 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.346 | 1517 | 38 | 0.407 | 0.000 | 1 |
| 1852 | 26 | M | 0 | Uneducated | Single | Less than $40K | Blue | 13 | 4 | 4 | 4 | 2237.000000 | 451 | 1786.000000 | 0.837 | 2192 | 36 | 0.440 | 0.202 | 1 |
| 4147 | 59 | M | 1 | Uneducated | Single | $40K - $60K | Silver | 49 | 3 | 2 | 2 | 15809.000000 | 0 | 15809.000000 | 0.992 | 3109 | 48 | 0.412 | 0.000 | 1 |
| 9759 | 44 | M | 3 | High School | NaN | $120K + | Blue | 36 | 2 | 3 | 2 | 7793.000000 | 2270 | 5523.000000 | 0.945 | 8260 | 54 | 0.500 | 0.291 | 1 |
| 9599 | 34 | M | 3 | Post-Graduate | Single | $40K - $60K | Blue | 28 | 2 | 3 | 2 | 3744.000000 | 0 | 3744.000000 | 0.796 | 8202 | 69 | 0.769 | 0.000 | 1 |
| 7126 | 49 | F | 3 | NaN | Married | Less than $40K | Blue | 33 | 6 | 3 | 3 | 2910.000000 | 0 | 2910.000000 | 0.789 | 2648 | 41 | 0.464 | 0.000 | 1 |
| 7973 | 60 | F | 0 | High School | Married | NaN | Blue | 55 | 4 | 1 | 3 | 5549.000000 | 0 | 5549.000000 | 0.703 | 2412 | 38 | 0.520 | 0.000 | 1 |
| 9959 | 32 | F | 0 | Graduate | Single | Less than $40K | Blue | 36 | 5 | 3 | 3 | 8135.000000 | 0 | 8135.000000 | 0.841 | 8620 | 67 | 0.971 | 0.000 | 1 |
| 5602 | 47 | F | 1 | Graduate | Married | $40K - $60K | Blue | 35 | 3 | 3 | 5 | 1438.300000 | 0 | 1438.300000 | 0.809 | 2276 | 40 | 0.538 | 0.000 | 1 |
| 6970 | 45 | M | 3 | College | Married | $80K - $120K | Blue | 36 | 3 | 2 | 2 | 3031.000000 | 2517 | 514.000000 | 0.409 | 2188 | 50 | 0.389 | 0.830 | 1 |
| 4344 | 51 | F | 2 | Post-Graduate | Single | Less than $40K | Blue | 36 | 3 | 3 | 3 | 4281.000000 | 0 | 4281.000000 | 0.447 | 1723 | 44 | 0.375 | 0.000 | 1 |
| 6018 | 55 | F | 1 | Graduate | Married | Less than $40K | Blue | 46 | 6 | 4 | 3 | 4232.000000 | 0 | 4232.000000 | 0.878 | 2312 | 37 | 0.609 | 0.000 | 1 |
| 1950 | 46 | M | 2 | Uneducated | NaN | $60K - $80K | Blue | 32 | 6 | 2 | 3 | 3254.000000 | 0 | 3254.000000 | 0.733 | 967 | 24 | 1.172 | 0.000 | 1 |
| 9943 | 52 | M | 2 | College | Single | $120K + | Blue | 32 | 4 | 3 | 1 | 4935.000000 | 0 | 4935.000000 | 0.713 | 7886 | 64 | 0.882 | 0.000 | 1 |
| 7544 | 48 | F | 3 | NaN | NaN | NaN | Blue | 36 | 3 | 3 | 2 | 4431.000000 | 2517 | 1914.000000 | 0.499 | 2185 | 48 | 0.371 | 0.568 | 1 |
| 4273 | 52 | F | 3 | High School | Married | $40K - $60K | Blue | 47 | 2 | 2 | 3 | 8018.000000 | 134 | 7884.000000 | 0.717 | 2409 | 38 | 0.310 | 0.017 | 1 |
| 3376 | 47 | F | 1 | Doctorate | Married | $40K - $60K | Blue | 36 | 2 | 3 | 2 | 1438.300000 | 0 | 1438.300000 | 0.838 | 2584 | 42 | 0.448 | 0.000 | 1 |
| 9008 | 45 | F | 3 | High School | Married | NaN | Blue | 40 | 1 | 4 | 2 | 8383.000000 | 0 | 8383.000000 | 1.036 | 5460 | 63 | 0.853 | 0.000 | 1 |
| 6535 | 59 | F | 1 | NaN | Single | Less than $40K | Blue | 53 | 5 | 6 | 2 | 7501.000000 | 2517 | 4984.000000 | 0.663 | 2280 | 49 | 0.531 | 0.336 | 1 |
| 8319 | 41 | F | 3 | Doctorate | Single | NaN | Blue | 36 | 2 | 3 | 6 | 2246.000000 | 0 | 2246.000000 | 0.725 | 3607 | 51 | 0.821 | 0.000 | 1 |
| 6326 | 45 | F | 3 | Graduate | Married | NaN | Blue | 36 | 5 | 2 | 3 | 18432.000000 | 484 | 17948.000000 | 0.557 | 2300 | 45 | 0.552 | 0.026 | 1 |
| 5530 | 34 | F | 1 | Post-Graduate | Single | Less than $40K | Blue | 26 | 3 | 2 | 5 | 2066.000000 | 347 | 1719.000000 | 0.511 | 2157 | 37 | 0.321 | 0.168 | 1 |
| 932 | 50 | M | 2 | NaN | Divorced | $80K - $120K | Blue | 36 | 4 | 3 | 1 | 12830.000000 | 0 | 11606.000000 | 0.926 | 1922 | 48 | 0.600 | 0.000 | 1 |
| 1443 | 37 | F | 3 | Graduate | NaN | NaN | Blue | 22 | 2 | 3 | 2 | 5674.000000 | 0 | 5674.000000 | 0.906 | 1096 | 29 | 0.526 | 0.000 | 1 |
| 691 | 62 | F | 2 | Graduate | Married | Less than $40K | Blue | 52 | 4 | 2 | 2 | 2490.000000 | 513 | 1977.000000 | 0.678 | 864 | 20 | 0.333 | 0.206 | 1 |
| 9142 | 41 | F | 1 | High School | Single | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 2667.000000 | 0 | 2667.000000 | 0.494 | 4399 | 54 | 0.862 | 0.000 | 1 |
| 9321 | 55 | F | 2 | Graduate | Married | NaN | Blue | 36 | 5 | 3 | 2 | 12494.000000 | 0 | 12494.000000 | 0.576 | 7026 | 69 | 0.917 | 0.000 | 1 |
| 9752 | 47 | M | 3 | Graduate | Single | $40K - $60K | Blue | 34 | 4 | 3 | 3 | 4413.000000 | 232 | 4181.000000 | 0.799 | 7869 | 76 | 0.767 | 0.053 | 1 |
| 6708 | 39 | F | 2 | High School | Single | NaN | Blue | 34 | 3 | 3 | 1 | 2069.000000 | 550 | 1519.000000 | 0.783 | 2716 | 43 | 0.433 | 0.266 | 1 |
| 2271 | 48 | M | 4 | Doctorate | Single | $80K - $120K | Blue | 28 | 3 | 3 | 4 | 6572.000000 | 417 | 6155.000000 | 0.584 | 1766 | 46 | 0.533 | 0.063 | 1 |
| 9564 | 40 | M | 3 | Graduate | Married | $60K - $80K | Blue | 33 | 5 | 3 | 6 | 5500.000000 | 442 | 5058.000000 | 0.609 | 7720 | 67 | 0.558 | 0.080 | 1 |
| 4262 | 53 | F | 3 | High School | Single | $40K - $60K | Blue | 47 | 1 | 3 | 4 | 1714.000000 | 0 | 1714.000000 | 0.661 | 2279 | 46 | 0.533 | 0.000 | 1 |
| 8268 | 49 | F | 1 | Graduate | Married | Less than $40K | Blue | 43 | 5 | 3 | 3 | 1938.000000 | 0 | 1938.000000 | 0.612 | 2870 | 49 | 0.750 | 0.000 | 1 |
| 7391 | 41 | F | 3 | Uneducated | Married | $40K - $60K | Blue | 36 | 1 | 3 | 4 | 3606.000000 | 0 | 3606.000000 | 0.370 | 1829 | 41 | 0.519 | 0.000 | 1 |
| 1011 | 31 | M | 2 | Uneducated | Married | Less than $40K | Blue | 18 | 3 | 2 | 4 | 1438.300000 | 0 | 1438.300000 | 1.201 | 2789 | 50 | 1.083 | 0.000 | 1 |
| 9833 | 41 | F | 4 | High School | Single | Less than $40K | Blue | 36 | 6 | 3 | 3 | 4720.000000 | 0 | 4720.000000 | 0.978 | 8798 | 67 | 0.675 | 0.000 | 1 |
| 7810 | 47 | F | 4 | Graduate | Single | Less than $40K | Blue | 39 | 3 | 3 | 6 | 1438.300000 | 0 | 1438.300000 | 0.854 | 2295 | 51 | 0.545 | 0.000 | 1 |
| 1697 | 37 | F | 2 | High School | Married | Less than $40K | Blue | 31 | 5 | 3 | 3 | 2352.000000 | 0 | 2352.000000 | 1.201 | 2344 | 48 | 0.920 | 0.000 | 1 |
| 7009 | 38 | F | 3 | High School | Married | Less than $40K | Blue | 27 | 4 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.730 | 2478 | 45 | 0.552 | 0.000 | 1 |
| 5547 | 52 | F | 3 | College | Married | Less than $40K | Blue | 40 | 3 | 2 | 3 | 3474.000000 | 0 | 3474.000000 | 0.735 | 2675 | 37 | 0.480 | 0.000 | 1 |
| 2088 | 27 | M | 0 | Uneducated | Married | $40K - $60K | Blue | 15 | 4 | 3 | 4 | 3682.000000 | 0 | 3682.000000 | 0.685 | 1826 | 35 | 0.750 | 0.000 | 1 |
| 6140 | 46 | M | 2 | NaN | Single | $80K - $120K | Blue | 34 | 2 | 3 | 3 | 7077.000000 | 0 | 7077.000000 | 0.838 | 2619 | 41 | 0.519 | 0.000 | 1 |
| 1968 | 40 | M | 3 | High School | Single | $80K - $120K | Blue | 29 | 2 | 2 | 4 | 18563.000000 | 0 | 18563.000000 | 0.729 | 2395 | 45 | 0.552 | 0.000 | 1 |
| 1742 | 39 | M | 1 | College | Divorced | $80K - $120K | Blue | 28 | 1 | 3 | 4 | 12464.000000 | 0 | 12464.000000 | 0.864 | 1195 | 35 | 0.522 | 0.000 | 1 |
| 1680 | 51 | F | 2 | Graduate | Married | $40K - $60K | Blue | 45 | 3 | 2 | 4 | 5930.000000 | 0 | 5930.000000 | 1.190 | 2041 | 32 | 0.684 | 0.000 | 1 |
| 1300 | 37 | F | 3 | High School | Married | Less than $40K | Blue | 19 | 5 | 1 | 3 | 4138.000000 | 0 | 4138.000000 | 1.168 | 2420 | 50 | 0.613 | 0.000 | 1 |
| 7672 | 44 | F | 3 | High School | Single | Less than $40K | Blue | 36 | 2 | 3 | 3 | 2228.000000 | 0 | 2228.000000 | 0.686 | 2523 | 45 | 0.607 | 0.000 | 1 |
| 5765 | 45 | F | 3 | Uneducated | Divorced | Less than $40K | Blue | 32 | 3 | 2 | 2 | 1942.000000 | 0 | 1942.000000 | 0.366 | 1853 | 31 | 0.228 | 0.000 | 1 |
| 7280 | 43 | F | 1 | Uneducated | Single | Less than $40K | Blue | 36 | 1 | 3 | 2 | 1818.000000 | 0 | 1818.000000 | 0.758 | 2279 | 50 | 0.724 | 0.000 | 1 |
| 1292 | 65 | M | 0 | Graduate | Married | $40K - $60K | Blue | 56 | 3 | 3 | 2 | 2297.000000 | 0 | 2297.000000 | 0.511 | 1941 | 51 | 0.417 | 0.000 | 1 |
| 3994 | 48 | M | 4 | College | Married | $80K - $120K | Blue | 32 | 5 | 2 | 3 | 12830.000000 | 0 | 11606.000000 | 0.467 | 1533 | 41 | 0.323 | 0.000 | 1 |
| 5421 | 53 | F | 3 | NaN | Married | Less than $40K | Blue | 36 | 3 | 3 | 3 | 2819.000000 | 0 | 2819.000000 | 0.400 | 2128 | 39 | 0.444 | 0.000 | 1 |
| 3451 | 48 | F | 2 | High School | Single | Less than $40K | Blue | 36 | 3 | 3 | 3 | 5483.000000 | 535 | 4948.000000 | 0.681 | 2051 | 49 | 0.633 | 0.098 | 1 |
| 6008 | 55 | F | 3 | Graduate | Married | NaN | Blue | 45 | 1 | 3 | 1 | 2815.000000 | 1848 | 967.000000 | 0.563 | 2618 | 29 | 0.450 | 0.656 | 1 |
| 4952 | 44 | M | 1 | NaN | Single | $60K - $80K | Blue | 33 | 3 | 2 | 4 | 2540.000000 | 0 | 2540.000000 | 0.289 | 1531 | 31 | 0.228 | 0.000 | 1 |
| 8088 | 47 | F | 3 | Graduate | NaN | Less than $40K | Blue | 38 | 2 | 2 | 3 | 3165.000000 | 2517 | 648.000000 | 0.748 | 2751 | 49 | 1.042 | 0.795 | 1 |
| 1809 | 28 | M | 0 | Post-Graduate | Married | $40K - $60K | Blue | 16 | 5 | 1 | 3 | 3665.000000 | 0 | 3665.000000 | 0.737 | 1813 | 31 | 0.409 | 0.000 | 1 |
| 3909 | 40 | F | 3 | NaN | Married | Less than $40K | Blue | 28 | 2 | 3 | 3 | 1535.000000 | 0 | 1535.000000 | 0.544 | 1921 | 41 | 0.464 | 0.000 | 1 |
| 3617 | 46 | M | 3 | Doctorate | Married | $120K + | Blue | 38 | 3 | 2 | 2 | 8989.000000 | 552 | 8437.000000 | 0.699 | 2309 | 49 | 0.633 | 0.061 | 1 |
| 8051 | 51 | F | 4 | Graduate | Single | Less than $40K | Blue | 44 | 5 | 3 | 3 | 1483.000000 | 0 | 1483.000000 | 0.707 | 2520 | 40 | 0.429 | 0.000 | 1 |
| 6506 | 48 | M | 2 | High School | Single | $60K - $80K | Blue | 36 | 4 | 3 | 1 | 1641.000000 | 0 | 1641.000000 | 0.791 | 2536 | 45 | 0.364 | 0.000 | 1 |
| 6053 | 48 | F | 4 | NaN | Married | Less than $40K | Blue | 41 | 3 | 3 | 3 | 2447.000000 | 1267 | 1180.000000 | 0.449 | 2132 | 37 | 0.370 | 0.518 | 1 |
| 1707 | 40 | M | 3 | College | Single | $80K - $120K | Silver | 33 | 1 | 3 | 4 | 12830.000000 | 0 | 11606.000000 | 0.927 | 1112 | 26 | 0.368 | 0.000 | 1 |
| 10125 | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000000 | 0 | 5281.000000 | 0.535 | 8395 | 62 | 0.722 | 0.000 | 1 |
| 2221 | 46 | M | 2 | Graduate | Single | $40K - $60K | Blue | 31 | 3 | 3 | 5 | 7869.000000 | 0 | 7869.000000 | 0.692 | 998 | 25 | 0.562 | 0.000 | 1 |
| 5481 | 40 | F | 4 | High School | Single | $40K - $60K | Blue | 29 | 5 | 2 | 3 | 1554.000000 | 412 | 1142.000000 | 0.568 | 2341 | 43 | 0.483 | 0.265 | 1 |
| 9464 | 47 | F | 3 | Doctorate | Married | Less than $40K | Blue | 35 | 2 | 3 | 2 | 4779.000000 | 2517 | 2262.000000 | 0.929 | 8912 | 63 | 0.658 | 0.527 | 1 |
| 5686 | 40 | M | 2 | NaN | Single | $60K - $80K | Blue | 27 | 6 | 2 | 3 | 12283.000000 | 0 | 12283.000000 | 0.413 | 1993 | 41 | 0.323 | 0.000 | 1 |
| 7948 | 41 | F | 4 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 2 | 2 | 1655.000000 | 0 | 1655.000000 | 0.496 | 2156 | 37 | 0.228 | 0.000 | 1 |
| 1228 | 65 | M | 0 | High School | Single | Less than $40K | Silver | 56 | 4 | 2 | 4 | 12609.000000 | 2517 | 10092.000000 | 0.833 | 2121 | 49 | 0.531 | 0.200 | 1 |
| 4118 | 50 | F | 2 | NaN | Married | NaN | Blue | 39 | 3 | 2 | 1 | 1773.000000 | 0 | 1773.000000 | 0.289 | 1764 | 30 | 0.228 | 0.000 | 1 |
| 1445 | 29 | M | 1 | High School | Married | $40K - $60K | Blue | 36 | 3 | 1 | 3 | 1438.300000 | 0 | 1438.300000 | 0.742 | 2144 | 41 | 0.577 | 0.000 | 1 |
| 247 | 46 | M | 4 | High School | Married | $120K + | Blue | 30 | 3 | 3 | 3 | 2442.000000 | 0 | 2442.000000 | 0.980 | 701 | 19 | 0.727 | 0.000 | 1 |
| 4192 | 47 | F | 4 | College | Single | Less than $40K | Blue | 36 | 2 | 2 | 3 | 2263.000000 | 0 | 2263.000000 | 0.752 | 2385 | 33 | 0.375 | 0.000 | 1 |
| 5838 | 45 | F | 4 | Doctorate | Married | Less than $40K | Blue | 27 | 3 | 2 | 2 | 1438.300000 | 0 | 1438.300000 | 0.680 | 2429 | 38 | 0.652 | 0.000 | 1 |
| 8259 | 56 | F | 3 | Graduate | Married | Less than $40K | Blue | 45 | 2 | 3 | 4 | 1438.300000 | 0 | 1438.300000 | 0.315 | 1988 | 42 | 0.355 | 0.000 | 1 |
| 2032 | 35 | M | 4 | Graduate | Married | $80K - $120K | Blue | 24 | 6 | 3 | 4 | 12634.000000 | 0 | 12634.000000 | 0.323 | 717 | 18 | 0.286 | 0.000 | 1 |
| 4496 | 49 | M | 3 | Graduate | Single | $80K - $120K | Blue | 36 | 3 | 2 | 1 | 3261.000000 | 0 | 3261.000000 | 1.012 | 3081 | 50 | 0.667 | 0.000 | 1 |
| 7133 | 47 | M | 2 | Doctorate | Divorced | $80K - $120K | Blue | 32 | 1 | 3 | 3 | 1438.300000 | 0 | 1438.300000 | 0.454 | 2388 | 42 | 0.500 | 0.000 | 1 |
| 4979 | 48 | F | 4 | Doctorate | NaN | NaN | Blue | 24 | 4 | 2 | 3 | 1438.300000 | 0 | 1438.300000 | 0.738 | 2306 | 46 | 0.484 | 0.000 | 1 |
| 8091 | 49 | F | 3 | Uneducated | Single | $40K - $60K | Blue | 38 | 1 | 2 | 3 | 2655.000000 | 2283 | 372.000000 | 0.751 | 2386 | 49 | 0.485 | 0.860 | 1 |
| 596 | 55 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 44 | 3 | 2 | 2 | 2323.000000 | 0 | 2323.000000 | 0.737 | 804 | 15 | 0.500 | 0.000 | 1 |
| 5632 | 41 | M | 4 | Uneducated | Married | $80K - $120K | Blue | 37 | 1 | 2 | 2 | 12830.000000 | 0 | 11606.000000 | 0.564 | 2205 | 43 | 0.654 | 0.000 | 1 |
| 727 | 56 | M | 3 | NaN | Married | $60K - $80K | Blue | 45 | 5 | 3 | 2 | 12248.000000 | 0 | 12248.000000 | 0.894 | 767 | 25 | 0.562 | 0.000 | 1 |
| 8464 | 56 | F | 1 | High School | Single | Less than $40K | Blue | 48 | 1 | 5 | 2 | 3556.000000 | 0 | 3556.000000 | 0.596 | 2220 | 45 | 0.731 | 0.000 | 1 |
| 1726 | 28 | F | 0 | NaN | Single | NaN | Blue | 13 | 5 | 3 | 2 | 6830.221299 | 661 | 5734.696416 | 0.767 | 2534 | 51 | 0.962 | 0.030 | 1 |
| 6577 | 44 | F | 3 | College | Single | Less than $40K | Blue | 34 | 4 | 3 | 1 | 1605.000000 | 890 | 715.000000 | 0.742 | 2722 | 45 | 0.500 | 0.555 | 1 |
| 1305 | 62 | M | 0 | College | Married | $40K - $60K | Blue | 54 | 3 | 3 | 4 | 1438.300000 | 0 | 1438.300000 | 0.795 | 1843 | 49 | 0.581 | 0.000 | 1 |
| 6373 | 59 | F | 2 | High School | Married | $40K - $60K | Blue | 36 | 3 | 3 | 3 | 4375.000000 | 0 | 4375.000000 | 0.621 | 2507 | 53 | 0.472 | 0.000 | 1 |
| 5993 | 47 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 2 | 3 | 2 | 10061.000000 | 0 | 10061.000000 | 0.787 | 2802 | 52 | 0.529 | 0.000 | 1 |
| 7409 | 44 | F | 5 | Graduate | NaN | Less than $40K | Blue | 39 | 6 | 3 | 2 | 1545.000000 | 0 | 1545.000000 | 0.749 | 2646 | 39 | 0.444 | 0.000 | 1 |
| 1061 | 65 | F | 1 | High School | Married | NaN | Blue | 36 | 5 | 2 | 2 | 5756.000000 | 0 | 5756.000000 | 1.046 | 2144 | 44 | 0.630 | 0.000 | 1 |
| 9240 | 40 | F | 3 | Graduate | Single | Less than $40K | Blue | 30 | 4 | 3 | 2 | 3319.000000 | 0 | 3319.000000 | 0.289 | 4840 | 44 | 0.375 | 0.000 | 1 |
| 8122 | 40 | M | 3 | High School | NaN | Less than $40K | Blue | 33 | 1 | 3 | 4 | 2420.000000 | 191 | 2229.000000 | 0.818 | 2478 | 43 | 0.433 | 0.079 | 1 |
| 9907 | 45 | F | 3 | NaN | Single | Less than $40K | Blue | 36 | 1 | 2 | 5 | 7500.000000 | 0 | 7500.000000 | 0.576 | 8024 | 78 | 0.696 | 0.000 | 1 |
| 6081 | 52 | M | 1 | High School | Married | $120K + | Blue | 36 | 5 | 3 | 3 | 18442.000000 | 0 | 17117.000000 | 0.735 | 3723 | 53 | 0.767 | 0.000 | 1 |
| 10034 | 56 | M | 4 | Graduate | Divorced | $60K - $80K | Blue | 36 | 6 | 3 | 3 | 6224.000000 | 0 | 6224.000000 | 0.920 | 8979 | 68 | 0.581 | 0.000 | 1 |
| 3954 | 58 | M | 1 | Graduate | Single | $120K + | Blue | 36 | 2 | 3 | 4 | 18442.000000 | 0 | 17117.000000 | 0.685 | 1712 | 41 | 0.414 | 0.000 | 1 |
| 1025 | 42 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 1 | 1 | 3 | 12830.000000 | 781 | 11606.000000 | 0.803 | 853 | 12 | 0.228 | 0.023 | 1 |
| 8360 | 45 | M | 4 | Doctorate | Married | $60K - $80K | Blue | 37 | 1 | 2 | 2 | 13093.000000 | 2517 | 10576.000000 | 0.426 | 2032 | 39 | 0.444 | 0.192 | 1 |
| 5919 | 45 | F | 4 | Graduate | Married | Less than $40K | Blue | 34 | 3 | 2 | 1 | 1438.300000 | 0 | 1438.300000 | 0.685 | 2153 | 35 | 0.346 | 0.000 | 1 |
| 2296 | 35 | F | 0 | Graduate | Single | Less than $40K | Blue | 25 | 5 | 3 | 4 | 2300.000000 | 0 | 2300.000000 | 0.877 | 1485 | 34 | 0.478 | 0.000 | 1 |
| 1786 | 33 | F | 2 | High School | Married | Less than $40K | Blue | 27 | 3 | 2 | 3 | 7185.000000 | 0 | 7185.000000 | 1.201 | 2679 | 45 | 0.800 | 0.000 | 1 |
| 9641 | 51 | F | 3 | Uneducated | Married | Less than $40K | Blue | 33 | 3 | 3 | 3 | 7428.000000 | 0 | 7428.000000 | 0.873 | 8078 | 62 | 0.722 | 0.000 | 1 |
| 624 | 42 | M | 2 | College | Married | $60K - $80K | Blue | 28 | 6 | 1 | 3 | 7340.000000 | 0 | 7340.000000 | 0.650 | 838 | 24 | 0.412 | 0.000 | 1 |
| 9337 | 53 | F | 1 | Graduate | Single | Less than $40K | Blue | 44 | 1 | 3 | 2 | 5541.000000 | 2517 | 3024.000000 | 0.879 | 9195 | 73 | 0.825 | 0.454 | 1 |
| 10040 | 50 | F | 3 | Doctorate | Single | NaN | Blue | 36 | 4 | 3 | 3 | 5173.000000 | 0 | 5173.000000 | 0.912 | 8757 | 68 | 0.789 | 0.000 | 1 |
| 9152 | 43 | F | 5 | Graduate | Married | Less than $40K | Silver | 33 | 6 | 3 | 2 | 14806.000000 | 2517 | 12289.000000 | 0.558 | 4804 | 57 | 0.462 | 0.170 | 1 |
| 3660 | 53 | F | 1 | NaN | Single | Less than $40K | Blue | 41 | 2 | 3 | 2 | 3099.000000 | 0 | 3099.000000 | 0.289 | 1401 | 44 | 0.419 | 0.000 | 1 |
| 10051 | 37 | M | 4 | High School | Single | $60K - $80K | Silver | 31 | 1 | 3 | 3 | 7660.000000 | 2517 | 6418.500000 | 0.770 | 8688 | 69 | 0.769 | 0.087 | 1 |
| 7663 | 42 | F | 1 | Uneducated | Single | Less than $40K | Blue | 35 | 4 | 3 | 3 | 3241.000000 | 0 | 3241.000000 | 0.442 | 2152 | 43 | 0.303 | 0.000 | 1 |
| 8507 | 55 | F | 1 | NaN | NaN | Less than $40K | Blue | 36 | 2 | 3 | 2 | 3626.000000 | 795 | 2831.000000 | 0.896 | 2954 | 42 | 0.556 | 0.219 | 1 |
| 8016 | 36 | F | 1 | Uneducated | NaN | Less than $40K | Blue | 36 | 2 | 3 | 2 | 1470.000000 | 458 | 1012.000000 | 0.543 | 2383 | 39 | 0.444 | 0.312 | 1 |
| 7291 | 54 | F | 2 | High School | Divorced | Less than $40K | Blue | 36 | 6 | 3 | 1 | 3591.000000 | 2517 | 1074.000000 | 0.665 | 2313 | 43 | 0.536 | 0.701 | 1 |
| 8220 | 54 | F | 3 | High School | Married | Less than $40K | Blue | 38 | 2 | 2 | 4 | 2452.000000 | 1012 | 1440.000000 | 0.460 | 2398 | 29 | 0.318 | 0.413 | 1 |
| 9985 | 30 | F | 0 | High School | Single | NaN | Blue | 22 | 1 | 2 | 4 | 13894.000000 | 0 | 13894.000000 | 0.845 | 9351 | 69 | 0.533 | 0.000 | 1 |
| 2873 | 51 | M | 3 | NaN | Married | $80K - $120K | Blue | 36 | 3 | 3 | 3 | 5130.000000 | 0 | 5130.000000 | 0.407 | 1434 | 31 | 0.292 | 0.000 | 1 |
| 819 | 44 | M | 2 | NaN | Married | $80K - $120K | Blue | 39 | 6 | 2 | 2 | 1438.300000 | 0 | 1438.300000 | 0.871 | 1495 | 39 | 0.393 | 0.000 | 1 |
| 7976 | 44 | M | 3 | Graduate | Married | $60K - $80K | Blue | 36 | 2 | 3 | 3 | 9114.000000 | 0 | 9114.000000 | 0.654 | 2740 | 52 | 0.625 | 0.000 | 1 |
| 7028 | 45 | M | 4 | Graduate | Married | $80K - $120K | Blue | 35 | 2 | 3 | 3 | 12830.000000 | 1307 | 11606.000000 | 0.462 | 2152 | 46 | 0.438 | 0.056 | 1 |
| 5886 | 47 | F | 4 | Graduate | Single | Less than $40K | Blue | 40 | 6 | 4 | 3 | 2473.000000 | 0 | 2473.000000 | 0.547 | 2133 | 44 | 0.419 | 0.000 | 1 |
| 2799 | 43 | M | 3 | High School | Married | $80K - $120K | Blue | 23 | 5 | 6 | 1 | 12315.000000 | 0 | 12315.000000 | 0.895 | 2028 | 51 | 0.545 | 0.000 | 1 |
| 2340 | 39 | F | 4 | NaN | Married | Less than $40K | Blue | 31 | 6 | 1 | 4 | 2890.000000 | 0 | 2890.000000 | 0.545 | 2040 | 36 | 0.440 | 0.000 | 1 |
| 5283 | 48 | F | 3 | Uneducated | Divorced | Less than $40K | Blue | 36 | 5 | 3 | 2 | 4073.000000 | 2517 | 1556.000000 | 0.575 | 2059 | 43 | 0.433 | 0.618 | 1 |
| 1935 | 44 | M | 1 | Graduate | Married | $80K - $120K | Blue | 33 | 5 | 4 | 3 | 19214.000000 | 0 | 19214.000000 | 0.998 | 3407 | 47 | 1.136 | 0.000 | 1 |
| 5312 | 48 | M | 3 | Post-Graduate | Married | $40K - $60K | Blue | 36 | 3 | 2 | 4 | 2954.000000 | 931 | 2023.000000 | 0.575 | 2433 | 29 | 0.381 | 0.315 | 1 |
| 6617 | 56 | F | 4 | Graduate | NaN | Less than $40K | Blue | 44 | 5 | 3 | 3 | 1677.000000 | 0 | 1677.000000 | 0.835 | 2477 | 45 | 0.324 | 0.000 | 1 |
| 5272 | 40 | F | 1 | High School | Single | Less than $40K | Blue | 36 | 4 | 3 | 2 | 1438.300000 | 544 | 894.300000 | 0.543 | 2294 | 40 | 0.739 | 0.378 | 1 |
| 5958 | 46 | F | 3 | High School | Married | $40K - $60K | Blue | 34 | 2 | 3 | 2 | 8583.000000 | 0 | 8583.000000 | 0.358 | 1794 | 31 | 0.292 | 0.000 | 1 |
| 1119 | 38 | M | 2 | High School | Married | $40K - $60K | Blue | 32 | 3 | 2 | 2 | 1438.300000 | 0 | 1438.300000 | 0.846 | 1999 | 39 | 0.560 | 0.000 | 1 |
| 2414 | 33 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 6 | 3 | 5 | 7140.000000 | 0 | 7140.000000 | 0.401 | 916 | 18 | 0.228 | 0.000 | 1 |
| 9854 | 51 | F | 4 | Post-Graduate | Married | $40K - $60K | Silver | 43 | 1 | 3 | 3 | 3735.000000 | 0 | 3735.000000 | 0.762 | 6826 | 61 | 0.848 | 0.000 | 1 |
| 10023 | 49 | F | 0 | NaN | Married | Less than $40K | Blue | 39 | 1 | 3 | 3 | 4982.000000 | 2517 | 2465.000000 | 0.903 | 9274 | 65 | 0.857 | 0.505 | 1 |
| 9859 | 53 | M | 4 | Graduate | Married | $120K + | Blue | 38 | 5 | 2 | 3 | 18442.000000 | 0 | 17117.000000 | 0.854 | 8417 | 72 | 0.714 | 0.000 | 1 |
| 7264 | 33 | M | 1 | Graduate | Single | $60K - $80K | Blue | 36 | 4 | 3 | 2 | 1819.000000 | 725 | 1094.000000 | 0.737 | 2254 | 45 | 0.406 | 0.399 | 1 |
| 5025 | 43 | F | 1 | Graduate | Married | Less than $40K | Blue | 36 | 4 | 2 | 4 | 1536.000000 | 0 | 1536.000000 | 0.571 | 1973 | 36 | 0.333 | 0.000 | 1 |
| 3086 | 62 | M | 0 | NaN | Single | $40K - $60K | Silver | 42 | 3 | 4 | 4 | 17161.000000 | 0 | 17161.000000 | 0.552 | 1898 | 42 | 0.556 | 0.000 | 1 |
| 6927 | 55 | F | 3 | NaN | Single | Less than $40K | Blue | 45 | 4 | 3 | 4 | 1741.000000 | 0 | 1741.000000 | 0.840 | 2546 | 42 | 0.400 | 0.000 | 1 |
| 3810 | 49 | M | 3 | High School | Married | $60K - $80K | Blue | 34 | 6 | 3 | 3 | 1511.000000 | 0 | 1511.000000 | 0.432 | 1329 | 35 | 0.591 | 0.000 | 1 |
| 4486 | 47 | M | 2 | Graduate | Married | $40K - $60K | Blue | 42 | 6 | 3 | 3 | 4738.000000 | 0 | 4738.000000 | 0.641 | 1900 | 36 | 0.500 | 0.000 | 1 |
| 1878 | 45 | M | 2 | College | Divorced | $80K - $120K | Blue | 39 | 3 | 4 | 4 | 12830.000000 | 0 | 11606.000000 | 1.014 | 1041 | 27 | 0.350 | 0.000 | 1 |
| 6907 | 62 | M | 0 | Post-Graduate | Divorced | $60K - $80K | Silver | 46 | 4 | 5 | 6 | 7660.000000 | 287 | 6418.500000 | 0.594 | 2281 | 45 | 0.552 | 0.010 | 1 |
| 1200 | 39 | F | 3 | High School | Single | Less than $40K | Blue | 36 | 6 | 1 | 3 | 3651.000000 | 0 | 3651.000000 | 0.977 | 862 | 19 | 1.111 | 0.000 | 1 |
| 7803 | 58 | F | 3 | NaN | Divorced | Less than $40K | Blue | 36 | 2 | 2 | 3 | 1508.000000 | 0 | 1508.000000 | 0.458 | 1990 | 31 | 0.228 | 0.000 | 1 |
| 537 | 45 | M | 2 | Graduate | Married | $60K - $80K | Blue | 25 | 3 | 5 | 3 | 1438.300000 | 0 | 1438.300000 | 0.676 | 709 | 27 | 0.421 | 0.000 | 1 |
| 717 | 43 | M | 3 | NaN | Married | $40K - $60K | Blue | 35 | 2 | 2 | 3 | 1438.300000 | 0 | 1438.300000 | 0.694 | 891 | 28 | 0.400 | 0.000 | 1 |
| 9972 | 52 | M | 3 | NaN | Single | $80K - $120K | Blue | 44 | 1 | 2 | 4 | 3526.000000 | 2429 | 1097.000000 | 0.712 | 7775 | 63 | 0.909 | 0.689 | 1 |
| 5300 | 53 | F | 0 | High School | Married | Less than $40K | Blue | 41 | 3 | 3 | 4 | 1438.300000 | 0 | 1438.300000 | 0.387 | 1587 | 42 | 0.448 | 0.000 | 1 |
| 1186 | 64 | F | 1 | High School | Married | $40K - $60K | Blue | 56 | 4 | 2 | 4 | 1438.300000 | 0 | 1438.300000 | 0.693 | 1417 | 35 | 0.591 | 0.000 | 1 |
| 9808 | 34 | M | 0 | Graduate | Divorced | $80K - $120K | Silver | 24 | 1 | 2 | 3 | 12830.000000 | 400 | 11606.000000 | 0.289 | 5112 | 49 | 0.256 | 0.012 | 1 |
| 1269 | 39 | F | 3 | Graduate | Married | Less than $40K | Blue | 32 | 5 | 6 | 3 | 3221.000000 | 0 | 3221.000000 | 0.678 | 1765 | 40 | 0.600 | 0.000 | 1 |
| 9435 | 52 | F | 4 | Uneducated | Divorced | NaN | Blue | 42 | 4 | 4 | 1 | 12091.000000 | 0 | 12091.000000 | 0.898 | 8127 | 74 | 0.805 | 0.000 | 1 |
| 5313 | 36 | F | 2 | High School | Married | $40K - $60K | Blue | 36 | 4 | 3 | 2 | 1848.000000 | 0 | 1848.000000 | 0.906 | 2775 | 33 | 0.435 | 0.000 | 1 |
| 5874 | 42 | F | 4 | Doctorate | Married | Less than $40K | Blue | 24 | 3 | 2 | 3 | 4304.000000 | 0 | 4304.000000 | 0.289 | 1788 | 36 | 0.228 | 0.000 | 1 |
| 4267 | 47 | F | 4 | Graduate | Divorced | Less than $40K | Blue | 42 | 1 | 3 | 3 | 5798.000000 | 2517 | 3281.000000 | 0.580 | 1860 | 43 | 0.433 | 0.434 | 1 |
| 9627 | 43 | M | 2 | Post-Graduate | Single | $60K - $80K | Blue | 36 | 2 | 3 | 3 | 17306.000000 | 0 | 17306.000000 | 0.400 | 6205 | 65 | 0.806 | 0.000 | 1 |
| 6522 | 42 | F | 4 | Graduate | Single | Less than $40K | Blue | 36 | 6 | 2 | 2 | 2803.000000 | 0 | 2803.000000 | 0.623 | 2216 | 40 | 0.600 | 0.000 | 1 |
| 4730 | 60 | F | 1 | NaN | Single | $40K - $60K | Blue | 41 | 3 | 2 | 4 | 2425.000000 | 0 | 2425.000000 | 0.289 | 1522 | 36 | 0.228 | 0.000 | 1 |
| 7331 | 42 | M | 5 | High School | Married | $60K - $80K | Blue | 36 | 1 | 2 | 3 | 1866.000000 | 0 | 1866.000000 | 0.798 | 2833 | 42 | 0.355 | 0.000 | 1 |
| 9860 | 45 | F | 2 | Graduate | Married | $40K - $60K | Blue | 26 | 6 | 2 | 4 | 4307.000000 | 0 | 4307.000000 | 0.743 | 8697 | 62 | 0.590 | 0.000 | 1 |
| 4762 | 55 | F | 2 | Graduate | Single | Less than $40K | Blue | 49 | 2 | 1 | 2 | 1583.000000 | 234 | 1349.000000 | 0.639 | 2541 | 36 | 0.385 | 0.148 | 1 |
| 4087 | 46 | M | 2 | College | Single | $80K - $120K | Blue | 38 | 2 | 3 | 2 | 10309.000000 | 0 | 10309.000000 | 0.451 | 1760 | 40 | 0.600 | 0.000 | 1 |
| 1967 | 43 | F | 3 | Doctorate | Married | NaN | Blue | 38 | 3 | 4 | 2 | 1947.000000 | 0 | 1947.000000 | 0.838 | 2119 | 52 | 1.000 | 0.000 | 1 |
| 9951 | 44 | F | 3 | NaN | Single | NaN | Blue | 34 | 2 | 3 | 3 | 6830.221299 | 0 | 5734.696416 | 1.040 | 8898 | 60 | 0.538 | 0.000 | 1 |
| 3653 | 55 | F | 2 | Graduate | Married | $40K - $60K | Blue | 49 | 3 | 3 | 4 | 1809.000000 | 0 | 1809.000000 | 0.569 | 2123 | 44 | 0.571 | 0.000 | 1 |
| 1953 | 47 | F | 3 | Graduate | Single | Less than $40K | Blue | 36 | 5 | 3 | 4 | 7246.000000 | 0 | 7246.000000 | 0.612 | 777 | 13 | 0.625 | 0.000 | 1 |
| 4957 | 54 | F | 2 | Doctorate | Married | Less than $40K | Blue | 41 | 2 | 2 | 3 | 3881.000000 | 2517 | 1364.000000 | 0.815 | 2463 | 38 | 0.407 | 0.649 | 1 |
| 734 | 43 | F | 4 | College | Married | Less than $40K | Blue | 23 | 6 | 2 | 3 | 7706.000000 | 392 | 7314.000000 | 0.764 | 965 | 27 | 0.421 | 0.051 | 1 |
| 9758 | 39 | M | 1 | Graduate | Married | $80K - $120K | Blue | 36 | 2 | 2 | 3 | 12830.000000 | 0 | 11606.000000 | 0.940 | 8549 | 73 | 0.780 | 0.000 | 1 |
| 3631 | 45 | M | 3 | NaN | Single | $60K - $80K | Blue | 35 | 6 | 2 | 1 | 18004.000000 | 2517 | 15487.000000 | 0.569 | 2030 | 45 | 0.452 | 0.140 | 1 |
| 5225 | 55 | F | 1 | Doctorate | Married | NaN | Blue | 39 | 1 | 3 | 3 | 12972.000000 | 1424 | 11548.000000 | 0.525 | 2238 | 36 | 0.228 | 0.110 | 1 |
| 8554 | 44 | F | 4 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 3 | 1905.000000 | 0 | 1905.000000 | 0.845 | 3120 | 46 | 0.586 | 0.000 | 1 |
| 4358 | 46 | M | 1 | Graduate | Single | $60K - $80K | Blue | 37 | 2 | 3 | 4 | 2445.000000 | 848 | 1597.000000 | 0.771 | 2216 | 39 | 0.560 | 0.347 | 1 |
| 7915 | 32 | F | 0 | NaN | Divorced | Less than $40K | Blue | 36 | 6 | 3 | 3 | 1732.000000 | 0 | 1732.000000 | 0.683 | 2479 | 43 | 0.536 | 0.000 | 1 |
| 4033 | 65 | M | 0 | Doctorate | Single | Less than $40K | Blue | 52 | 3 | 3 | 4 | 3675.000000 | 0 | 3675.000000 | 0.664 | 3798 | 53 | 0.472 | 0.000 | 1 |
| 7856 | 59 | M | 1 | College | Single | $120K + | Blue | 47 | 2 | 4 | 2 | 4789.000000 | 357 | 4432.000000 | 0.798 | 2496 | 41 | 0.640 | 0.075 | 1 |
| 10098 | 55 | M | 3 | Graduate | Single | $120K + | Silver | 36 | 4 | 3 | 4 | 18442.000000 | 0 | 17117.000000 | 1.007 | 9931 | 70 | 0.750 | 0.000 | 1 |
| 3940 | 39 | F | 0 | NaN | Married | $40K - $60K | Silver | 34 | 3 | 0 | 3 | 15142.000000 | 0 | 15142.000000 | 0.761 | 2458 | 41 | 0.577 | 0.000 | 1 |
| 467 | 43 | F | 2 | Uneducated | Single | Less than $40K | Blue | 24 | 2 | 3 | 2 | 2962.000000 | 2517 | 445.000000 | 1.046 | 929 | 31 | 0.409 | 0.850 | 1 |
| 3655 | 48 | M | 4 | NaN | Married | $80K - $120K | Blue | 36 | 2 | 3 | 2 | 12830.000000 | 0 | 11606.000000 | 0.574 | 2045 | 45 | 0.500 | 0.000 | 1 |
| 6952 | 34 | F | 1 | Uneducated | NaN | Less than $40K | Blue | 25 | 1 | 2 | 3 | 3074.000000 | 0 | 3074.000000 | 0.718 | 2721 | 54 | 0.688 | 0.000 | 1 |
| 4898 | 30 | F | 1 | Graduate | Married | Less than $40K | Blue | 36 | 2 | 2 | 3 | 2905.000000 | 2517 | 388.000000 | 0.725 | 2487 | 39 | 0.393 | 0.866 | 1 |
| 4811 | 59 | F | 2 | NaN | Single | NaN | Blue | 45 | 1 | 3 | 3 | 2721.000000 | 1885 | 836.000000 | 0.853 | 2594 | 48 | 0.455 | 0.693 | 1 |
| 580 | 56 | M | 1 | College | Married | $60K - $80K | Blue | 44 | 5 | 3 | 3 | 1704.000000 | 0 | 1704.000000 | 0.701 | 660 | 17 | 0.308 | 0.000 | 1 |
| 1383 | 27 | M | 0 | NaN | Single | $40K - $60K | Blue | 17 | 5 | 1 | 2 | 4610.000000 | 0 | 4610.000000 | 0.794 | 2280 | 49 | 0.400 | 0.000 | 1 |
| 9787 | 45 | M | 5 | College | Single | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 4982.000000 | 0 | 4982.000000 | 0.886 | 8586 | 58 | 1.000 | 0.000 | 1 |
Based on the Customer Information:
Based on the Attrition data of the Customers, we found the following insights that can be leveraged as recommendations for understanding the Customers: